US20240205520
2024-06-20
Electricity
H04N21/8549
A method for summarising long videos featuring spoken content is presented, leveraging speech recognition to convert video files into transcripts. These transcripts are then segmented into multiple paragraphs using a transformer-based model. Each paragraph is assigned an importance score based on its relevance to the overall content of the transcript, enabling effective ranking and selection of significant information.
The method involves creating candidate summaries from the ranked paragraphs, each containing two or more paragraphs. Coherence reranking is applied to these candidate summaries, evaluating both coherence and relevance. This process ensures that the final selected summary is not only important but also coherent and understandable for viewers.
Key innovations include the use of unsupervised learning techniques for extractive summarisation, which addresses limitations of existing systems that often rely on labeled data. The method employs pretrained transformer models to compute similarities between paragraphs, utilizing a page rank algorithm for effective relevance ranking. Additionally, an adapter-based approach is used to segment spoken text into coherent paragraphs.
Once a summary is selected, segments of the original video corresponding to the chosen paragraphs are identified using timestamps, allowing for the creation of a summary video or audio. The method also supports generating shortened summaries or trailer videos from selected sentences, providing flexibility in how summarized content can be utilized.
This summarisation system offers significant improvements over traditional methods by producing high-quality, coherent summaries without needing human feedback or ground truth comparisons. It enables users to access multiple variations of summaries tailored to different audience needs while optimizing computational efficiency by avoiding redundant processing of speech recognition models.