One of the reasons a dynamic language is slower than a static one is that it infers data types at runtime.  While the program runs, the language figures out what data type to use for each variable. This slows down the program.
A static language doesn't slow down the program because it knows the data types at compile time: they're written in the program. And if they aren't, most times the static language can infer them.
But it slows down the programmer who has to write the data types.
As if someone has to go slow regardless and the only question is who, program or programmer.
I think there might be a third option that lets both the program and the programmer go fast. Rather than get help on data types from a human or from software, to get help from hardware: to have the processor infer data types while the program runs.
There are three cases where a dynamic language must do more to handle data types. It looks like hardware can handle them all.
Dynamic data type detection
A dynamic programming language must first detect the data type of the input before it does anything with it. Like tell apart strings from numbers.
Strings are easy to detect, because they start with a double quote " character that means what will follow is a string. Telling apart an integer from a decimal needs more work.
This logic is often built in the runtime parser. Because this happens at runtime, it makes a dynamic language slower than a static one that could take as much time as it needs to parse at compile time. But although a static language wouldn't run faster if the processor had faster parsing, a dynamic one would.
The data type detection code boils down to a few if statements. There's a call to strtold to detect decimals, a call to strtoll to detect integers, and a call to strcmp to detect booleans. There's also code to tell apart numbers from the special symbols "inf" (for infinity) and "nan" (for not a number), but only because strtold and strroll depend on them in the C library. A language could do without these special symbols. 
I realize the code is not small, but this is more of a reason to make it run in a few cycles in hardware. The larger the code the bigger the opportunity to go faster.
I see two complications though. The first is that the input passed to the data type detection code comes from a reader that runs in a loop. The reader loop needs to break up strings at whitespace. This can be hard to get right and run fast, so it'd be nice to have this in hardware too.  But the bigger reason to add the reader in hardware isn't just to break strings, but to read the input fast. Reading input by calling C's getc for each character will never be as fast as the hardware looking at multiple bytes of the input in a single cycle.
The second issue is that since string input can be large there should be a way for the reader to allocate more memory by calling something like malloc. A reader data structure of a fixed-size can't store strings of arbitrary size.
How fast can this get? I didn't estimate the limit, but I measured a Lisp VM is 30x slower to read the input (+ 6 7) than evaluate it (0.057336ms to read, 0.001856ms to evaluate). So it's worth making the reader faster.
Dynamic data type upgrades
Now let's forget parsing and look at a program in a dynamic language that runs slower when evaluated. A program that adds 0.2 + 4.
Adding at runtime a number that has decimals with an integer needs the language to upgrade the integer to have decimals too. The upgrade code takes longer to execute than treating the integer ahead of time as a decimal.
Could the processor figure out the data types of this basic example?
The logic for it isn't hard. If we look at this logic in a virtual machine, the upgrade code has two data flow dependencies to read the data types of the numbers (from
Here's what to add in the processor to support dynamic data type upgrades for arithmetic:
I don't know if this could be done in only three processor cycles but by definition it would be faster if done in hardware than software.
In effect the datatypes are stored in an array. So although slower, an alternative would be to use memory instead of registers. This shouldn't be a problem since it's common for processors to offer instructions that do some calculations for software faster based on a predefined data structure. 
Another way to solve this problem, as I noticed later, may be to make the processor save the datatype in the value.
Relational operators for dynamic types
Another basic and common example that makes a dynamic program run slower is comparing two values. The language has to first check the data type of each value before it does the comparison.
Once again the logic for this isn't hard. The comparison code consists of multiple if statements, one per data type, and it can be hardwired in the processor to run in a couple of cycles. 
To support dynamic data types for relational operators:
Although the examples we saw are simple, it's easy to come up with complex examples that use other data types. What's striking about the complex examples is that many data types beyond the primitives like numbers and strings can be built on top of simpler data types that are easy to define, like lists and hash tables. Simple, dynamic typing in the processor based on an array may be enough to build complex types.
This will need a format to describe the types, but it's not particularly hard. Virtual machines do this already.
What there is to gain from processor support for dynamic types isn't only that programs will run faster. If that was all, we could just wait for quantum processors. Everything would run fast then. The bigger gain is simplicity.  If the processor can figure out data types we can throw out a lot of needless static analysis.  The processor depends on runtime information for much of its speed anyway. Add enough in the processor and we can also throw out virtual machines since their very existence is proof the processor has limitations.
Supporting dynamic types is such an obvious and promising solution that you would expect it'd be in processors by now. But no processor I know of does this. I searched for papers online and didn't find something relevant. Unless I missed something only experts know, the most likely explanation is that dynamic typing in the processor is a gap between two fields no one wanted to explore. Hardware designers aren't as interested in dynamic programming languages because they prefer hardware circuits to software. If it were the other way around, they'd be writing web apps in dynamic languages and wouldn't be as interested in hardware.
We won't know how faster dynamic programming languages can get, and how simpler it will be to write them, until processors add support for dynamic typing.
I ran this by someone who designed a processor. They said it's better to have wider operands to save the dynamic type, because putting this data outboard like the two-register idea would double the number of ports on the register file. That it would take a cycle or two to determine if the processor should use the ALU or the FPU, and then another couple of cycles to allocate the FPU because it could be in use for some other operation. And that dynamic languages run a parser at load time, not run time.
Thanks to the processor expert for the feedback.