| CPC H04N 19/167 (2014.11) [H04N 19/172 (2014.11)] | 20 Claims |

|
1. A system comprising:
at least one processor; and
at least one non-transitory computer-readable storage medium having computer-executable instructions stored thereon which, when executed on the at least one processor, cause the system to perform operations comprising:
receiving, from a third-party device associated with a third-party, a request;
identifying, via the request, a first set of boundaries associated with first locations of first frames in media content;
identifying a second set of boundaries and a third set of boundaries associated with the media content, the second set of boundaries being associated with second locations of second frames in the media content and being generated utilizing a computer vision/machine learning (CV/ML) device, the third set of boundaries being associated with third locations of third frames in the media content, the third locations being default locations generated utilizing an encoder algorithm for placing instantaneous decoder refresh (IDR) frames;
merging a combination of boundaries, including the first set of boundaries, the second set of boundaries, and the third set of boundaries, to generate a target set of boundaries, the target set of boundaries being associated with target locations of target frames in the media content;
performing, based on a target boundary report, a CV/ML boundary report, and a default boundary report, an encoding process with the target set of boundaries to encode video content and audio content of the media content as encoded media content, the target boundary report including the target set of boundaries and having a higher level of accuracy than at least one of the CV/ML boundary report or the default boundary report, the CV/ML boundary report including the second set of boundaries, the default boundary report including the third set of boundaries; and
packaging the encoded media content as packaged media content, with segments of the audio content being aligned with the target set of boundaries, such that the video content and the audio content are synchronized during playback of the encoded media content.
|