Extracting risk information

Remove the need for manual work by automatically gathering and harmonizing text-based information.

Summary

A Safety Data Sheets (SDS) is a standardized document by which chemical manufacturers communicate chemical’s hazard information to chemical handlers. SDS typically contain chemical properties, health and environmental hazards, protective measures, as well as safety precautions for storing, handling, and transporting chemicals.

Chemical handlers extract information from SDS just reading the section of interest, but this “manual” workflow is not effective if the Health, Safety & Environment (HSE) manager would like to gather hazards information about all chemicals used in the company in order to put in place an adequate risk management plan. Here we describe a KNIME workflow able to extract hazards information from thousands of SDS in an automated fashion.

Detailed description

The European Union (EU) requires that risk phrases (R-phrases), specifying danger(s), appear on each label and safety data sheet for hazardous chemicals. Safety phrases (S-phrases) for handling precautions are also part of the same requirements. Currently, both risk and safety phrases are being phased out in favor of Hazard (H) Statements and Precautionary (P) Statements under the EU's implementation of the Globally Harmonized System of Classification and Labeling of Chemicals (GHS). Regulators are pushing companies to harmonize and gather this information and make it available to anyone has to work with a chemical substance like synthetic or analytical chemist.

To do this, a mixture of text mining and string manipulation was used to extract every risk phrase reported in a collection of SDS. The global output is partitioned according to dangerousness of substances and then downloaded directly by the user.

The project started gathering the Safety Data Sheet (SDS) from as many different sources, customers, and providers as possible. A KNIME workflow is built using that can also be deployed on KNIME Server if more computational power was needed. The user uploads either a single PDF file, a library of PDF files (as a zipped archive), or a PDF-containing folder (but only within KNIME Analytics Platform), as well as an Excel file with the list of all the requested phrases to be updated.

Text mining nodes are applied on the result of the Tika Parser in order to extract all sentences composing each file. Every sentence, using string or regex manipulation, is analyzed by searching the CAS number, product name, and all risk phrases. A try and catch construct is necessary due to large variations in the input files (document date, language, producer). This avoids workflow interruption. The results report the file name, the product name, all the CAS numbers retrieved in each document as well as all the phrases retrieved, which are matched with the defined user list. According to the regulation on very dangerous substances, SDS reporting codes concerning mutagen or cancerogenic dangers are saved in a second Excel file.

Achievable results

Using KNIME Analytics Platform, the recovery of risk phrases (comprising all R- and S-phrases and H- and P-statements) was fully accomplished. With a few thousand PDF files, all SDS present in a medium-size company, were parsed in less than an hour. Other notable results include:

  • Significant time savings of repetitive operations (from about two minutes to few seconds for each SDS)
  • Useful both for batch processing of SDS files and single (new) SDS file
  • Avoidance of deprecated terms due to updating the risk phrases list using an Excel file

KNIME software used to achieve result

The open source KNIME Analytics Platform makes this task not only faster, but also reduces the risk of human error.

Specifically, three features contributed significantly to the success of this project. Firstly, the Tika Parser node played a fundamental role by enabling the retrieval of meta information from each file. Secondly, the try/catch errors construct was an effective way to avoid workflow errors. Thirdly, in order to isolate CAS numbers from PDFs, regex code in a java snippet was used.

KNIME output / graphic / interface

Graphical output of the workflow. The table shows the filename of the SDS, the product name, all the retrieved CAS numbers and all the phrases contained in the document.

Latest news

Workshop al 63° Simposio AFI 5 giugno 2024 - Rimini

Workshop al 63° Simposio AFI 5 giugno 2024 - Rimini

 

Mercoledì 5 giugno vi invitiamo a partecipare al Workshop:
 

PROCESSI PRODUTTIVI:
INNOVAZIONE NELL’IDENTIFICAZIONE E VALUTAZIONE DELLE IMPUREZZE E NELLA GESTIONE INTEGRATA DELLE INFORMAZIONI

 

Saremo lieti di incontrarvi presso lo stand n. 5 dell'area espositiva.

 

Read more...

Meet TOXIT at SITOX 2023!

Meet TOXIT at SITOX 2023!

TOXIT is enthusiastic to join the 21st National Congress of Toxicology organised by the Italian society of Toxicology SITOX! We will be in Bologna from the 20th to the 22nd of February. We are also glad to share that Dr. Marta Lettieri will be presenting a poster titled “Skin sensitization: AOPs and (Q)SAR models”  (P5/1). 

Looking forward to meeting you there!

Read more...

Machine learning in early prediction of metabolism of drugs for rare diseases

We are proud to announce that our colleague Marta Lettieri will be speaking at the workshop "Modelling & Simulation: Research Methodologies for Small Populations in Rare Diseases", which will be held in Bari on July 4th-5th.

Read more...


 

S-IN Soluzioni Informatiche S.r.l. - via Malta 6/B - 25124 - Brescia - Italy

E-mail: info@s-in.it - PEC: mail@pec.s-in.it - Tel. +39 0444 1821160 - Fax +39 0444 1821169

 C.F. & P.IVA IT02397280245 - Ufficio del Registro: Vicenza - REA: BS-624000 - Capitale sociale € 15.000,00 i.v.

Privacy Policy