Voice synthesis on ISR

Page 32/36
25 | 26 | 27 | 28 | 29 | 30 | 31 | | 33 | 34 | 35 | 36

By Grauw

Ascended (10558)

Grauw's picture

07-12-2021, 00:33

ARTRAG wrote:

Assume you have two successive waves:
w0 = ____/‾‾‾‾\ and w1 = _/‾\__/‾\_
to be played during frame 0 and frame 1 respectively.
Each frame lasts 1/50 sec.

But that situation is relatively uncommon isn’t it? Human voices don’t change so abrubtly. If I scroll through the “Power Up” sample from your repository I only see similar waveforms with smooth progressions between them.

Maybe we should talk about more concrete examples, because I think we’re not having the same situations in mind. What I described indeed does not do anything for your case, but it does address a different issue.

The thing I have in mind with aligning the waves is seen in for example the “Yeah” in GhostWriterP’s Awesome song, the waveform moves like this:

Here you can see that the frames all have similar pairs of up and down bumps, but they continuously shift around rather than staying stationary in place relative to each other (like standing waves). A byte that is a peak in one frame will be a valley in the next. Whereas if you just look at the amplitudes, it changes reasonably gradually.

I think this is the reason the Yeah sample doesn’t sound very clean.

By Grauw

Ascended (10558)

Grauw's picture

07-12-2021, 00:39

ARTRAG wrote:

Assume you have two successive waves:
w0 = ____/‾‾‾‾\ and w1 = _/‾\__/‾\_
to be played during frame 0 and frame 1 respectively.
Each frame lasts 1/50 sec.

To go into that example a bit;

I think the solution for that does not lie in rotating the waveform. It’s too impractical to do timing-wise, the ISR needs to execute with a spacing of exactly 1/50.16th or 1/59.92th of a second and this is difficult to achieve in practice. Not even considering that the VDP and CPU sometimes have a different clock.

So discarding that option, looking at the following four situations where this can occur:

  1. An abrupt octave shift, but the harmonics are similar. The ideal algorithm should change the base frequency rather than doubling up the waveform.
  2. An abrupt change in the harmonics, but the base frequency is similar. I think no distortion will be perceived because the tonal character already changes abruptly.
  3. An abrupt change in both the base frequency and harmonics. I think no distortion will be perceived because the tonal character and frequency already change abruptly.
  4. A misdetection of the fundamental frequency, and the previous frame’s base frequency was an octave higher than this doubled-up frame. The ideal algorithm should have resistance against abrupt frequency changes when the harmonics are similar.

I fully understand that “ideal” pitch tracking is difficult to achieve, but I do think that improvements in that area is where the solution lies for this particular example, and in the other two cases there would be no need for a solution.

By chance have you been looking at papers in the area of pitch detection? I’ve been reading up on it a bit this week and I would be interested if there are some in particular that are good reads. I could also post relevant ones that I come across…

By GhostwriterP

Hero (653)

GhostwriterP's picture

06-12-2021, 21:09

Lining up the wave forms can be a post encoding operation just by checking al 32 rolled positions of the next wave form compared to the previous and taking the least squared error or similar or not?. I never got around to testing it myself yet, but I think it could improve the quality a bit.

By Grauw

Ascended (10558)

Grauw's picture

06-12-2021, 21:29

Yes, it would be the final step in the algorithm. Though ideally it would operate on the DFT of the original waveform before quantising to a 32-byte waveform for the SCC, for optimal resolution.

I’m not sure what the best solution would be in the time domain (if there is one), that’s why I described it in terms of the phase of the fundamental frequency. This does not rely on a delta to the previous frame. But anyway once you know that phase offset, in the time domain you can rotate the waveform samples, albeit limited to a phase resolution of 2π/32.

But minimising the least squared error would probably do a decent job lining up the waves. It’s certainly much easier to code up given the output of ARTRAG’s conversion tool. It’d be interesting to hear the difference even as a confirmation before attempting a more sophisticated approach.

By ARTRAG

Enlighted (6828)

ARTRAG's picture

07-12-2021, 12:52

Grauw wrote:
ARTRAG wrote:

Assume you have two successive waves:
w0 = ____/‾‾‾‾\ and w1 = _/‾\__/‾\_
to be played during frame 0 and frame 1 respectively.
Each frame lasts 1/50 sec.

But that situation is relatively uncommon isn’t it? Human voices don’t change so abrubtly. If I scroll through the “Power Up” sample from your repository I only see similar waveforms with smooth progressions between them.

Maybe we should talk about more concrete examples, because I think we’re not having the same situations in mind. What I described indeed does not do anything for your case, but it does address a different issue.

The thing I have in mind with aligning the waves is seen in for example the “Yeah” in GhostWriterP’s Awesome song, the waveform moves like this:

Here you can see that the frames all have similar pairs of up and down bumps, but they continuously shift around rather than staying stationary in place relative to each other (like standing waves). A byte that is a peak in one frame will be a valley in the next. Whereas if you just look at the amplitudes, it changes reasonably gradually.

I think this is the reason the Yeah sample doesn’t sound very clean.

Let me restate your ideas about zeros in maths.

If we rotate w1 at such a phase that minimise the error with w0 (e.g. mean square error) then when we pass from w0 to w1, wherever the sample timer is (in 0-31), the next sample in w1 should be more or less similar to the next sample in w0.

This will not guarantee a continuous wave (unless w0 and w1 are essentially equal) but could be an attempt to reduce the volume jumps at wave change.

I understand that if w1 and w2 are very different the result will be about the same we get now, but in a certain number of cases it could give an improvement. Is this what I should try?

By Grauw

Ascended (10558)

Grauw's picture

07-12-2021, 16:01

Yes that’s the idea. As far as I can see each frame’s wave is currently rotated randomly, I think aligning them will give a noticeable improvement to the quality.

By ARTRAG

Enlighted (6828)

ARTRAG's picture

07-12-2021, 16:53

This week end I will do some experiment, if there is some improvement, I could also easily optimise the phase rotation in the DFT domain to gain sub sample resolution, e.g. in 2π/256 or more.

By Grauw

Ascended (10558)

Grauw's picture

07-12-2021, 23:25

I made a test, using an A4 sine since my algorithm’s pitch detection is… not great yet.

Sine without normalizing the phase

This example represents the output without regard to phase. In this particular example, the 60 Hz changes to the waveform introduce various overtones. For speech the effect is a bit less pronounced.

Sine with normalized phase

In this one the phase is normalized. A little distortion is still audible, this is the unavoidable SCC noise due to setting the frequency register, which resets the period counter. The normalisation method is as I described it before:

const phase = Math.PI / 2 + fs[1].polar().i;
for (let n = 1; n < 17; n++)
{
    fs[n] = fs[n].polar().add(new Complex(0, -phase * n)).euler();
}

This gives a nice standing sine wave instead of a rotating one. Note that I substract an additional ½π, this centers the waveform more nicely (visually).

(I only alter the first 16 frequencies, after this conjugates are added to complete the 32 frequency bins before applying the inverse DFT.)

I also tried the “Power Up!” sample, and it also sounds cleaner with this phase normalisation. I couldn’t really apply it to a longer sample though (to make a good example), since as said my pitch detection algorithm is quite quickly thrown together and does not work very well.

Sine with normalized phase, without setting frequency

In this last example I commented out the setting of the frequency just to prove there is no noise remaining.

By Grauw

Ascended (10558)

Grauw's picture

08-12-2021, 00:56

I found a longer speech sample that converted well enough that I could use it as an example:

“Alarm” sample without normalizing the phase

Here the waveform jumps around a lot.

“Alarm” sample with normalized phase

Here the waveform stands still and just reshapes itself.

And lastly, a video with a fun visualisation.

By ARTRAG

Enlighted (6828)

ARTRAG's picture

08-12-2021, 11:40

Very good, it confirms that the noise was related to the discontinuity between successive wave samples and your proposal of locking the wave sample to the phase of the first harmonic gives already an audible improvement. From my understanding, the closer the wave sample is to a pure tone, the better it works.

Due to the fact that the the first harmonic in the wave sample should be the one with highest energy, optimising the rotation to minimise the MSE between successive wave samples probably, in the DFT domain, is almost equivalent to choose the phase of the first harmonic of the previous sample as linear offset for the second.
This relates to your method but I cannot fully predict the effect of using as reference the fist bin of the previous sample as compared to the effect of using the first bin of the current sample itself.

It is worth a try to see what sounds best.

Page 32/36
25 | 26 | 27 | 28 | 29 | 30 | 31 | | 33 | 34 | 35 | 36