Extracting Desired Information from Unstructured Text Using Deep Learning Architectures

read 7 mts

Named-entity recognition is a sub-branch of Information retrieval systems which automatically extracts names of people, locations, organisations etc. from the unstructured data. With the text data growing every second, NER systems help business do the most important tasks.

The main problem with this available text data is its unstructured format, which is not ready for consumption. As much as the businesses need the data in a structured format, there would be some processes or access issues that make the data available in the form of unstructured text. Learning about the plight of the teams to get this data into a structured format, which takes into account an ample amount of economic resources as well as their time, it would be of real help to the businesses if we can bring in unstructured data into a structured format. Even if there is not 100% accuracy, it would still help the businesses significantly in reducing man-hours.

Bhargavi Eruva and Ramesh Melapu, Data Scientists at INSOFE aimed at building a deep learning system to automate the extraction of required entities from the unstructured data. These entities cross the boundaries of people, locations etc and any important information which user want to extract from the text.

Application Areas:

  • Content Classification for Media & Entertainment Industries.
  • Information extraction for further analysis of HR Consulting firms, Financial Institutions, Chemical Industries, etc.
  • Recommendation engines powered with content extractions.
  • Search Engines
  • Customer support systems

Implementation Details:

  1. Data

  • Key information extraction from 381 resumes which are available on dataturks.com.
  • Trade details extraction from 543 trade files which are available in the PDF format.
  • Trade files are tagged for required entities using Prodigy tool. Whereas tagged resumes were already available.
  1. Design & Constraints

In general, NER problems are solved using a sequence to sequence prediction type of architectures. In lines with the literature study, we developed LSTM architectures to decode the problem. But every text data with LSTM architectures have the following problems:

  • Fixed vocabulary size: Every neural network in text learns on a fixed set of words during the training phase. We essentially ignore the new words in test data. At the same time, the probability of new words in the type of data discussed above is high. For example, trade dates or graduation years appear in various combinations. It is simply impossible to take a unique set of these words appearing in train data to learn the model.

  • Fixed sequence length: Number of words in each data point is fixed to train the network. A typical resume might range between 300 words to 1000 words. This large range of lengths causes an increase in the number of weights to learn which is not very efficient.

  • Transferable architectures: At the end of this study, we want to create architectures which can be transferable across different domains achieving reasonable metrics by tuning very few parameters.
  1. Experiments

Four different architectures are discussed in the study. Few of them can help tackle the above problems.

 Basic LSTM with fixed vocabulary: 

This model consists of three layers.

  • Word Embedding Layer: Maps each word to a vector space using an embedding model.
  • Modelling Layer: A Long Short-Term Memory Network (LSTM) is used on top of the embeddings provided by the previous layers to understand the context.
  • Output Layer: Provides a sequence of tag for each word in the input data.

Though Basic LSTM does not handle the problems discussed, it gives the basic picture of the model performance and all the architectures are tuned to get closer to this model performance.

LSTM with Character level embeddings: 

This architecture is built to tackle the fixed vocabulary problem by embedding each character in a word instead of words. Usually, word postfix or prefix explains a lot about the meaning of the word. This information is very important if you deal with texts such as molecular engineering research papers that contain rare words at inference time. This network consists of four layers.

  • Character Embedding Layer: Maps each character of every word to vector space using a time distributed embedding model.
  • Word Embedding Layer: A two-layer Convolution Neural Network (CNN) Layer is used to understand the word representation from character embeddings.
  • Modelling Layer: A Long Short-Term Memory Network (LSTM) is used on top of the embeddings provided by the previous layers to understand the context.
  • Output Layer: Provides a sequence of tag for each word in the input data.
LSTM With Bloom Embeddings: 

While we learn word representations using the embedding layer, every word in the vocabulary should be one hot encoded. This is one limitation due to which we cannot expand the vocabulary size. This limitation is gracefully handled by Spacy using Bloom embeddings.

Bloom embeddings learn the word representations from dense encodings instead of sparse one hot encodings. Furthermore, these dense representations are the vector sum of four vectors derived from hash embedded keys. Bloom embeddings were developed using the linked blog post:  https://support.prodi.gy/t/can-you-explain-how-exactly-hashembed-works/564

  • Bloom Embedding Layer: Maps each word to vector space using a time distributed dense model.
  • Modelling Layer: A Long Short-Term Memory Network (LSTM) is used on top of the embeddings provided by the previous layers to understand the context.
  • Output Layer: Provides a sequence of tag for each word in the input data.
CNN classification with Bloom Embeddings: 

This architecture tackles the sequence length problem. Since context depends on the proximity words, the entity tag of the word can be predicted by taking a window of the words right and left to the entity word. By considering each word as an individual case, we are eliminating the need of max sequence length.

  • Bloom Embedding Layer: Maps each word to vector space using a time distributed dense model.
  • Modelling Layer: A Convolution Neural Network (CNN) is used on top of the embeddings provided by the previous layers to understand the context of the proximity words
  • Output Layer: Provides a tag for a word in the input data.
  1. Results

Results of the above architectures on the two datasets are as follows:

  • Trade Files Entity Prediction: Thirteen Entities such as trade date, maturity date, options etc from trade files are of interest.

Loss Metrics

Model Train Loss Test Loss Train Accuracy Test Accuracy
LSTM 0.0015 0.0451 0.9997 0.99
Character embedding using CNN and LSTM 0.0334 0.0293 0.9907 0.9923
LSTM with Bloom Embeddings 0.0166 0.0169 0.9956 0.9955
CNN Classification with Bloom Embeddings 0.0257 0.037 0.992 0.9924
  • Resume Entity Prediction: Twenty-one Entities such as designation, Year of graduation, experience details, skills etc from resumes are of interest.
Model Train Loss Test Loss Train Accuracy Test Accuracy
LSTM 0.0136 0.2765 0.9959 0.9564
Character embedding using CNN and LSTM 0.1298 0.1605 0.9608 0.9519
LSTM with Bloom Embeddings 0.0955 0.1424  0.9700 0.9581
CNN Classification with Bloom Embeddings 0.24 0.38 0.921 0.9
  1. Deployment Framework

Used Flask framework and built a simple flask app that takes keras model and deployed it using REST API.

In order to successfully deploy,

  1. Keras is configured and installed on the machine where the application is deployed.
  2. Install Flask, a Python web framework to build the API endpoint.
  3. Build Keras REST API consisting of load_model, used to load already trained keras model and prepare for inference. Predict, the actual endpoint of the API that will classify the incoming data from the request and return results to the client.
  4. Start the keras RESP API service by running the respective python application.

Ex. Open the terminal and execute:

F:\GEM\UI>python app1.py
Using TensorFlow backend.
* Debugger is active!
* Debugger PIN: 204-558-283
* Running on http://127.0.0.1:5001/ (Press CTRL+C to quit)

Example 1:

Step 1: Open the terminal and execute the python application as above, an URL will be displayed. Access the server using the above URL.

However, if you were to copy and paste the IP address + port into your browser, you would see the following image:

Step 2: Select the appropriate options, for doc type, model and browse select the document for which predictions have to be made and click on UPLOAD.

The models are trained on 2 different document types to evaluate the model performance for the different domains (Financial trade documents and Resumes).

There are 4 different models that are built.

  1. Basic LSTM
  2. CHAR LSTM using CNN
  3. LSTM using Bloom embeddings
  4. LSTM using Bloom embeddings and CNN for classification
This is an example for the resume (document type) and BASIC_LSTM (model) and uploaded a resume that was not used as part of training data. The entities tagged for this resume are:

Further reformatting has to be done on top of these Named Entity predictions, to get this data into a proper tabular format.

hope this article helps you. 
Any questions, feedbacks, suggestions for improvement are most welcome. 🙂

1+

Leave a Reply

Your email address will not be published. Required fields are marked *