document-retrieval-system
This will be a very quick post on document retrieval system, where we see, how we can retrieve the similar set of document on the basis of queries.
❤ Problem statement ❤
Let’s say, we have 1000
documents unlabeled document. Our objective is to return the response of query
document, which will be the most similar document (semantic-wise
) This problem is similar to topic modelling
.
Few Approaches:
There are some topic modelling approach, which we will discuss later.
- Latent Semantic Analysis (LSA)
- Probabilistic-LSA (PSLA)
- Latent Dirichlet Allocation(
Bayesian version of PSLA
)
Document Retrieval Approach using deep learning.
We will go thorough step by step procedure to prepare the pipeline.
- we prepare a
bag-of-words
for these document. For text cleaning, we use:- remove punctuation, stop words and spaces.
- perform stemming(
Removing the suffix such as -ize, -s, -es etc
).
- We prepare count vector of
top-words
from all document. To use in neural network as fearture, we can preparetfidf
features. It is avvreviated asterm frequency inverse document frequency
which is calculated asterm-freq * log(inv-doc-freq)
, whereterm-freq = count-of-word / total-word
idf = log(toal-document / #document-in-which-word-appear )
- Prepare a auto-encoder model, with very small latent dimensions. For example, if we select
5000
top-word, we can have choose following configuration:5000 -> 1000 -> 200 -> 10 -> 200 -> 1000 -> 5000
Note: our loss function is to reconstruct the original features - Now, we have latent features(
10 dims
), we extract this features, for each document and as well as the query. Now only step remaining is to findcosine-similarity
between each document with the query. If we want tocategorize
each document, we can do the same usingsimilarity
metrics. Note: For this we need to compare each pair, which can be huge (NC2 combinations
)
That’s it, we can get similar document as the query set, if NN is trained well.
Applications of information retrieval
- better labeling of product let’s say on amazon, flipkart etc (let’s say, we have some wooden chair for child, it is labeled as furniture, now using the above method, we can find other product similar to this, from there we can have history of buyer and their other product and can estimate its more better label as
baby-product
) - dicovering similar neughbourhood (for house price estimation, viloent/crime forecasting etc)
- structring web search results (categorize the result, for example we search for
watson
, it will showibm-watson
,emma-watson
or other things, so we can display these result structurally based on categories) - meta feature to train another model