Extracting risk information
Remove the need for manual work by automatically gathering and harmonizing text-based information.
Summary
A Safety Data Sheets (SDS) is a standardized document by which chemical manufacturers communicate chemical’s hazard information to chemical handlers. SDS typically contain chemical properties, health and environmental hazards, protective measures, as well as safety precautions for storing, handling, and transporting chemicals.
Chemical handlers extract information from SDS just reading the section of interest, but this “manual” workflow is not effective if the Health, Safety & Environment (HSE) manager would like to gather hazards information about all chemicals used in the company in order to put in place an adequate risk management plan. Here we describe a KNIME workflow able to extract hazards information from thousands of SDS in an automated fashion.
Detailed description
The European Union (EU) requires that risk phrases (R-phrases), specifying danger(s), appear on each label and safety data sheet for hazardous chemicals. Safety phrases (S-phrases) for handling precautions are also part of the same requirements. Currently, both risk and safety phrases are being phased out in favor of Hazard (H) Statements and Precautionary (P) Statements under the EU's implementation of the Globally Harmonized System of Classification and Labeling of Chemicals (GHS). Regulators are pushing companies to harmonize and gather this information and make it available to anyone has to work with a chemical substance like synthetic or analytical chemist.
To do this, a mixture of text mining and string manipulation was used to extract every risk phrase reported in a collection of SDS. The global output is partitioned according to dangerousness of substances and then downloaded directly by the user.
The project started gathering the Safety Data Sheet (SDS) from as many different sources, customers, and providers as possible. A KNIME workflow is built using that can also be deployed on KNIME Server if more computational power was needed. The user uploads either a single PDF file, a library of PDF files (as a zipped archive), or a PDF-containing folder (but only within KNIME Analytics Platform), as well as an Excel file with the list of all the requested phrases to be updated.
Text mining nodes are applied on the result of the Tika Parser in order to extract all sentences composing each file. Every sentence, using string or regex manipulation, is analyzed by searching the CAS number, product name, and all risk phrases. A try and catch construct is necessary due to large variations in the input files (document date, language, producer). This avoids workflow interruption. The results report the file name, the product name, all the CAS numbers retrieved in each document as well as all the phrases retrieved, which are matched with the defined user list. According to the regulation on very dangerous substances, SDS reporting codes concerning mutagen or cancerogenic dangers are saved in a second Excel file.
Achievable results
Using KNIME Analytics Platform, the recovery of risk phrases (comprising all R- and S-phrases and H- and P-statements) was fully accomplished. With a few thousand PDF files, all SDS present in a medium-size company, were parsed in less than an hour. Other notable results include:
- Significant time savings of repetitive operations (from about two minutes to few seconds for each SDS)
- Useful both for batch processing of SDS files and single (new) SDS file
- Avoidance of deprecated terms due to updating the risk phrases list using an Excel file
KNIME software used to achieve result
The open source KNIME Analytics Platform makes this task not only faster, but also reduces the risk of human error.
Specifically, three features contributed significantly to the success of this project. Firstly, the Tika Parser node played a fundamental role by enabling the retrieval of meta information from each file. Secondly, the try/catch errors construct was an effective way to avoid workflow errors. Thirdly, in order to isolate CAS numbers from PDFs, regex code in a java snippet was used.
KNIME output / graphic / interface
Graphical output of the workflow. The table shows the filename of the SDS, the product name, all the retrieved CAS numbers and all the phrases contained in the document.