Large Scale Chemical Patent Mining with UIMA and UNICORE‭ (‬UIMA-HPC‭)

Marc Zimmermann¹, Alexander Piechot²,

¹ Fraunhofer-Institute for Algorithms and Scientific Computing‭ (‬SCAI‭)‬,‭ ‬Sankt Augustin,‭ ‬53754,‭ ‬Germany
² Taros Chemicals GmbH‭ & ‬Co.‭ ‬KG,‭ ‬Dortmund,‭ ‬44227,‭ ‬Germany

(Folien)

 

Finding information about annotated chemical reactions for drugs and small compounds is a crucial step for‭ ‬the pharmaceutical industry.‭ ‬This data often is presented in form of unstructured documents‭ (‬especially patents‭) ‬and manual extraction of this information is a time-‭ ‬and cost inefficient effort.

In our project UIMA-HPC‭[‬1‭]‬,‭ ‬we describe the combined usage of‭  ‬Unstructured‭ ‬Information‭ ‬Management‭ ‬Architecture‭ (‬UIMA‭) ‬and‭ ‬Uniform‭ ‬Interface to‭ ‬Computing‭ ‬Resources‭ (‬UNICORE‭) ‬for large-scale chemical patent mining.‭ ‬Our approach will incorporate existing software such as chemoCR for image processing‭ (‬image-to-structure‭) ‬and OCR for text reconstruction.‭ ‬All components are wrapped inside the UIMA framework pipeline.‭ ‬Using the UIMA framework ensures compatibility between different components of the pipeline and makes it possible to connect arbitrary annotation modules into this system.‭ ‬Scale-out for large document collections is‭ ‬achieved by the UNICORE framework on‭ ‬High‭ ‬Performance‭ ‬Clusters,‭ ‬which enables‭ ‬parallelization of all UIMA nodes.‭ ‬The aim is a fully annotated pdf collection where all biomedical entities‭ (‬compound names,‭ ‬reaction schemes,‭ ‬etc.‭) ‬are connected by references and thus can be easily browsed and searched by the user.‭


[1‎] ‏http://www.uima-hpc.org,‭ ‬funding:‭ ‬BMBF grant‭ ‬01IH11012