Abstract
The common law system is a legal system that values precedent, or previous court decisions, in the resolution of current cases. As the availability of legal documents in digital form has increased, it has become more difficult for legal professionals to manually identify relevant past cases due to the vast amount of data. Researchers have developed automated systems for determining the similarity between legal documents to address this issue. Our research explores various representations of a legal document and discusses a novel paragraph filtering process to identify key paragraphs using legal citation information to remove unnecessary text paragraphs without disturbing the concept of the legal document. State-of-the-art techniques like TF-IDF, BERT, Legal Bert, Doc2Vec, and Legal-longformer are used for the performance analysis of the proposed approach with document comparison. It has been shown that a model trained on the proposed filtered paragraphs can achieve better results than a model trained on the complete text and can also shorten the document by over 40%. The proposed filtering strategy could be helpful for models like BERT, where the maximum token length is fixed.