• Guest, before posting your code please take these rules into consideration:
    • It is required to use our BBCode feature to display your code. While within the editor click < / > or >_ and place your code within the BB Code prompt. This helps others with finding a solution by making it easier to read and easier to copy.
    • Don't share a wall of code. All we want is the problem area, the code related to your issue.


    To learn more about how to use our BBCode feature, please click here.

    Thank you, Code Forum.

JavaScript Cleaning .srt with Javascript .replace

Richie C

New Coder
Hi,

I have an html page were user upload an .srt (subtitle file) and I want to clean/remove unnecessary text.

Example uploaded .srt file:
9
00:00:31,690 --> 00:00:35,550
have taken years to complete. Due to this, no

10
00:00:35,550 --> 00:00:38,100
single donkey could have possibly completed the

11
00:00:38,100 --> 00:00:41,790
journey, so this resulted in unplanned matings,

The result that I want is:
have taken years to complete. Due to this, no single donkey could have possibly completed the journey, so this resulted in unplanned matings,

The code I use at the moment is:
JavaScript:
reader.onload = (e) => {
                    //console.log(files[i].name, e.target.result);
                    var fileName = files[i].name;
                    var text = e.target.result;
                    text = text.replace(/WEBVTT[\r\n]/,"");
                    text = text.replace(/NOTE duration:.*[\r\n]/,"");
                    text = text.replace(/NOTE language:.*[\r\n]/,"");
                    text = text.replace(/NOTE Confidence:.+\d/g,"");
                    text = text.replace(/NOTE recognizability.+\d/g,"");
                    text = text.replace(/[\r\n].+-.+-.+-.+-.+/g,"");
                    text = text.replace(/[\r\n].+ --> .+[\r\n]/g,"");
                    text = text.replace(/.[\r\n]. --> .+[\r\n]/g,"");
                    text = text.replace(/[\n](.)/g," $1");
                    text = text.replace(/[\r\n]+/g,"");
                    text = text.replace(/^ /,"");
                    var heading = document.createElement('h3');
                    document.body.appendChild(heading);
                    heading.innerHTML = "Transcript for '" + files[i].name + "'";
                
                    var copyButton = document.createElement('button');
                    document.body.appendChild(copyButton);
                    copyButton.onclick = function() {copyToClip(text,fileName); };
                    copyButton.innerHTML = "Copy transcript";
                    copyButton.className = "copyButton";
                
                    var div = document.createElement('div');
                    document.body.appendChild(div);
                    div.className = "cleanVTTText";
                    div.innerHTML = text;


But this leave the text like this:
9 have taken years to complete. Due to this, no 10 single donkey could have possibly completed the 11 journey, so this resulted in unplanned matings,


What's the best way to remove the ascending numbers? (they can go from 1 digit to 3 digits long)
Thanks
 

cbreemer

King Coder
Thanks both for your help, that seems to be working for me but I will need to keep an eye out if there are digits in the text!
Yes. Actually for that reason, this is not a good solution at all. What you need to do is early on skip the 'lines' recording just a sequence of digits. If you replace this

JavaScript:
 var text = e.target.result;
by this

JavaScript:
var text = "\n" + e.target.result;
text = text.replace(/[\r\n][0-9]+[\r\n]/g, "\n");

it should work correctly in all cases.

Personally I'd have read the file line by line, filter out all the lines I don't want, and only then string it all together. To my simple brain all this is just complicating the issue. But I have to admit I don't like regex a lot, and just as soon will write some code than working out these devilish regex expressions.
 

Richie C

New Coder
Yes. Actually for that reason, this is not a good solution at all. What you need to do is early on skip the 'lines' recording just a sequence of digits. If you replace this
...
My knowledge of this coding is pretty minimal, I 'borrowed' the code from a Microsoft Subtitle cleaner and tried to adapt it for my purposes.

I've added this line of code:
JavaScript:
text = text.replace(/\n?\d*?\n?^.* --> [012345]{2}:.*$/mg ,"");

Which seems to do everything. It removes the single or double digit but doesn't remove numbers from within the text. Could you see any issues using this?
 

cbreemer

King Coder
My knowledge of this coding is pretty minimal, I 'borrowed' the code from a Microsoft Subtitle cleaner and tried to adapt it for my purposes.

I've added this line of code:
JavaScript:
text = text.replace(/\n?\d*?\n?^.* --> [012345]{2}:.*$/mg ,"");

Which seems to do everything. It removes the single or double digit but doesn't remove numbers from within the text. Could you see any issues using this?
Hell no.... what could possibly go wrong ? 🤣
Joking apart, this kind of cryptic regex stuff is not my thing at all. I can't even be bothered to try and understand your construction. But if it works, it works.
 
Top