Invention Title:

METHOD FOR COHERENT, UNSUPERVISED, TRANSCRIPT-BASED, EXTRACTIVE SUMMARISATION OF LONG VIDEOS OF SPOKEN CONTENT

Publication number:

US20240205520

Publication date:
Section:

Electricity

Class:

H04N21/8549

Inventors:

Assignee:

Applicant:

Drawings (4 of 7)

Smart overview of the Invention

A method for summarising long videos featuring spoken content is presented, leveraging speech recognition to convert video files into transcripts. These transcripts are then segmented into multiple paragraphs using a transformer-based model. Each paragraph is assigned an importance score based on its relevance to the overall content of the transcript, enabling effective ranking and selection of significant information.

Relevance Ranking and Candidate Summaries

The method involves creating candidate summaries from the ranked paragraphs, each containing two or more paragraphs. Coherence reranking is applied to these candidate summaries, evaluating both coherence and relevance. This process ensures that the final selected summary is not only important but also coherent and understandable for viewers.

Technical Details and Innovations

Key innovations include the use of unsupervised learning techniques for extractive summarisation, which addresses limitations of existing systems that often rely on labeled data. The method employs pretrained transformer models to compute similarities between paragraphs, utilizing a page rank algorithm for effective relevance ranking. Additionally, an adapter-based approach is used to segment spoken text into coherent paragraphs.

Summary Video Generation

Once a summary is selected, segments of the original video corresponding to the chosen paragraphs are identified using timestamps, allowing for the creation of a summary video or audio. The method also supports generating shortened summaries or trailer videos from selected sentences, providing flexibility in how summarized content can be utilized.

Advantages and Applications

This summarisation system offers significant improvements over traditional methods by producing high-quality, coherent summaries without needing human feedback or ground truth comparisons. It enables users to access multiple variations of summaries tailored to different audience needs while optimizing computational efficiency by avoiding redundant processing of speech recognition models.