CNNs are capable of extracting spatial correlations among independent variables, while LSTM excels at capturing temporal correlations within input sequences. (https://dl.acm.org/doi/10.1145/3690771.3690774)
The input to CNN must be arranged into 2D array or "feature map" with the shape (Time Steps × Features). Following the input layer is filter (aka kernel, sliding window). With a filter size of 3, this filter looks at 3 time steps and across the features simultaneously. By doing this, the CNN captures how the 12 independent variables interact with each other within a small local window. The input may be 245 time steps and 12 features (i.e., WxL), the first Conv layer may output 245 time steps, 10 features, 32 filters (WxLxDepth), the second Conv layer may output 245 time steps, 8 features, 1 filter. This reduction means the CNN has successfully compressed the 12 feaures into a single optimized "8 feature representation" that carries the most significant spatial information. So the CNN helps "distill" the most important spatial correlations before passing them to the LSTM.
The input to LSTM can have multiple layer as inside an LSTM cell, there are "gates" (Forget, Input, and Output gates). Each gate is essentially a small neural network.