General & NLP Terminology Interview Questions
Last updated
Last updated
Natural Language Processing is a field of computer science that deals with communication between computer systems and humans. It is a technique used in Artificial Intelligence and Machine Learning.
Stop words are said to be useless data for a search engine. Words such as articles, prepositions, etc. are considered as stop words. There are stop words such as was, were, is, am, the, a, an, how, why, and many more. In Natural Language Processing, we eliminate the stop words to understand and analyze the meaning of a sentence.
3. What is NLTK?
NLTK is a Python library, which stands for Natural Language Toolkit. We use NLTK to process data in human spoken languages. NLTK allows us to apply techniques such as parsing, tokenization, lemmatization, stemming, and more to understand natural languages.
4. What is Syntactic Analysis?
Syntactic analysis is a technique of analyzing sentences to extract meaning from it. Using syntactic analysis, a machine can analyze and understand the order of words arranged in a sentence.
Parsing: It helps in deciding the structure of a sentence or text in a document. It helps analyze the words in the text based on the grammar of the language.
Word segmentation: The segmentation of words segregates the text into small significant units.
Morphological segmentation: The purpose of morphological segmentation is to break words into their base form.
Stemming: It is the process of removing the suffix from a word to obtain its root word.
Lemmatization: It helps combine words using suffixes, without altering the meaning of the word.
5.What is Semantic Analysis?
Semantic analysis helps make a machine understand the meaning of a text. It uses various algorithms for the interpretation of words in sentences. It also helps understand the structure of a sentence.
6. Named entity recognition:
This is the process of information retrieval that helps identify entities such as the name of a person, organization, place, time, emotion, etc
TFIDF or Term Frequency-Inverse Document Frequency indicates the importance of a word in a set. It helps in information retrieval with numerical statistics. For a specific document, TF-IDF shows a frequency that helps identify the keywords in a document. The major use of TF-IDF in NLP is the extraction of useful information from crucial documents by statistical data. It is ideally used to classify and summarize the text in documents and filter out stop words.
TF helps calculate the ratio of the frequency of a term in a document and the total number of terms. Whereas, IDF denotes the importance of the term in a document.
The formula for calculating TF-IDF:
TF(W) = (Frequency of W in a document)/(The total number of terms in the document)
IDF(W) = log_e(The total number of documents/The number of documents having the term W)
When TF*IDF is high, the frequency of the term is less and vice versa.
Pragmatic analysis is an important task in NLP for interpreting knowledge that is lying outside a given document. The aim of implementing pragmatic analysis is to focus on exploring a different aspect of the document or text in a language. This requires a comprehensive knowledge of the real world. The pragmatic analysis allows software applications for the critical interpretation of the real-world data to know the actual meaning of sentences and words.
Pragmatic Analysis is concerned with outside word knowledge, which refers to information that is not contained in the documents and/or questions. The many parts of the language that require real-world knowledge are derived from a pragmatics analysis that focuses on what was described and reinterpreted by what it truly meant.
Example:
Consider this sentence: ‘Do you know what time it is?’
This sentence can either be asked for knowing the time or for yelling at someone to make them note the time. This depends on the context in which we use the sentence.
Pragmatic ambiguity refers to the multiple descriptions of a word or a sentence. An ambiguity arises when the meaning of the sentence is not clear. The words of the sentence may have different meanings. Therefore, in practical situations, it becomes a challenging task for a machine to understand the meaning of a sentence. This leads to pragmatic ambiguity.
Example:
Check out the below sentence.
‘Are you feeling hungry?’
The given sentence could be either a question or a formal way of offering food.
10. What are unigrams, bigrams, trigrams, and n-grams in NLP?
When we parse a sentence one word at a time, then it is called a unigram. The sentence parsed two words at a time is a bigram.
When the sentence is parsed three words at a time, then it is a trigram. Similarly, n-gram refers to the parsing of n words at a time.
Example: To understand unigrams, bigrams, and trigrams, you can refer to the below diagram:
herefore, parsing allows machines to understand the individual meaning of a word in a sentence. Also, this type of parsing helps predict the next word and correct spelling errors.
11. What is Feature Extraction in NLP?
Features or characteristics of a word help in text or document analysis. They also help in sentiment analysis of a text. Feature extraction is one of the techniques that are used by recommendation systems. Reviews such as ‘excellent,’ ‘good,’ or ‘great’ for a movie are positive reviews, recognized by a recommender system. The recommender system also tries to identify the features of the text that help in describing the context of a word or a sentence. Then, it makes a group or category of the words that have some common characteristics. Now, whenever a new word arrives, the system categorizes it as per the labels of such groups.
The parts-of-speech (POS) tagging is used to assign tags to words such as nouns, adjectives, verbs, and more. The software uses the POS tagging to first read the text and then differentiate the words by tagging. The software uses algorithms for the parts-of-speech tagging. POS tagging is one of the most essential tools in Natural Language Processing. It helps in making the machine understand the meaning of a sentence.
13. Language Modeling:
Based on the history of previous words, this helps uncover what the further sentence will look like. A good example of this is the auto-complete sentences feature in Gmail.
14. Topic Modelling:
This helps uncover the topical structure of a large collection of documents. This indicates what topic a piece of text is actually about.
15. Information Retrieval:
This helps in fetching relevant documents based on a user’s search query.
16. Information Extraction:
This is the task of extracting relevant pieces of information from a given text, such as calendar events from emails.
Normalization works by mapping all values of a feature to be in the range [0,1] using the transformation:
xnorm=x−xminxmax−xminxnorm=x−xminxmax−xmin
Suppose a particular input feature x
has values in the range [x_min, x_max]
. When x
is equal to x_min
, x_norm
is equal to 0 and when x
is equal to x_max
, x_norm
is equal to 1. So for all values of x
between x_min
and x_max
, x_norm
maps to a value between 0 and 1.
Standardization, on the other hand, transforms the input values such that they follow a normal distribution with zero mean and unit variance (unit Gaussian). Mathematically, the transformation on the data points in a distrbution with mean μ and standard deviation σ is given by:
xstd=x−μσxstd=x−μσ
In practice, this process of standardization is also referred to as normalization (not to be confused with the normalization process discussed above). As part of the preprocessing step, you can add a layer that applies this transform to the input features so that they all have a similar distribution. In Keras, you can add a normalization layer that applies this transform to the input features.
.