Similarity Detector (ORSI)

GitHub: https://github.com/OpenReqEU/similarity-detection

The OpenReq SImilarity detector (ORSI) component is a similarity detection service of requirements. Similarity detection, also known as paraphrase detection, is an approach closely related to detection of interdependencies in between requirements. This relationship is evident in the case where two requirements have almost the exact same formulation, since in this case we would have an OR interdependency between the requirements. Imagine the requirements The interface should use the letter type Arial and The interface should use the letter type Calibri; it is clear that these two requirements are similar (except for the words Arial and Calibri) and they cannot be used in the same system (since it is not possible two use two letter types for the whole system interface), so these requirements are related by an OR interdependency.

Using the name and description of requirements, ORSI computes the similarity score between requirements using a combination of Term Frequency - Inverse Document Frequency and the Cosine metric. Specifically, ORSI receives an initial set of requirements that go through a Natural Language Processing pipeline with tokenization, stemming and stop-word removal. For each output token, it computes the number of occurrences in the whole set of requirements and it calculates the inverse frequency. Next, ORSI generates a vector for each input requirement with a size equal to its number of tokens. Each value of the vector contains the multiplication of the token frequency in the requirement by its inverse frequency in the whole set of requirements calculated previously. To obtain the similarity between two requirements we compute the cosine of the angle between their respective vectors. Since the vectors are numerical representation of the requirements which preserve their meaning in a way, the angle is a good measure for their similarity.