In short
- We tested Grok 4 Basic, and it was great at reasoning – but creative tasks feel flat and the coding left our error detection in circles.
- A bizarre Elon filter seems to ships in the direction of the political attitude of Musk.
- Voice functions have stunned us with stories before bedtime and ‘Sexy Modus’, but the political answers of Grok-4 ultrasound still eat ultrasound-sparrow’s, which is exactly the opposite of being a ‘truth seeking’ AI.
Elon Musk unveiled Grok 4 during a Wednesday evening live stream and claimed that his AI startup Xai had created the ‘smartest artificial intelligence in the world. Grok 4 Heavy, who compared musk with “a study group” where agents compare notes before they deliver an answer, record record -breaking results at various important benchmarks, and is what you hopefully get from a company that costs no less than $ 300 per month.
But what about Basic Grok 4, which strives for the same category consumer -oriented category as Chatgpt Plus, Gemini Pro and Claude Pro? Is it worth $ 10+ per month more than the competition?
Our tests substantiated chatter about X and revealed that the model has a lack of better description-a built-in “Elon filter”. That is, when we tested controversial topics – the war in Gaza, abortion rights and other political issues – the model consistently referred to X reports from Musk’s account or news articles about his positions, and landed on such an extent that it could not be a coincidence. That alone will be a deal breaker for most people.

This “maximum seeking truth” Ai who promises Musk during the launch seems to look in the first place through the lens of the social media of the Creator.
But unlike this week’s Mechhitler incident, which was caused by a change in the fast circumstances of the system, there is nothing in the current system of Grok 4 that blames such sketchy behavior, making it difficult to know whether this is deliberately deeper into the model’s thinking process.
Apart from that big problem, we have tested the basic model in several categories to see how things are going against the competition. Here are our first impressions.
Reasoning and common sense
Grok 4 showed exceptionally consciousness and nuance in dealing with trick questions and complex reasoning tasks. When asked whether it was legal for a man to marry his widow’s sister, the model immediately acknowledged this as a legal question instead of just pointing to the logical misconception. It gave a detailed legal analysis using precise terminology and jurisdiction-specific information.
“The question presupposes a factual impossibility that makes every marriage legally invalid,” and this is correct. It is actually impossible to marry your widow’s sister because you should be dead to have a woman. And well, because the dead cannot get married, the proposition would make the marriage “ab initio” (since the beginning) invalid even if someone does that, it is considered as if it never happened.
When Musk said that Grok 4 was promoted in everything, he was not a joke. Every answer to every subject in which it had to do some scientific reasoning came with exhausting details and academic strictness.
Something to note is that Grok uses 4 reasoning everythingWhatever happens. This means that it runs through a chain of thinking process, even for trivial tasks.

This is usually a good thing; In some cases, however, it can be counterproductive. With creative tasks, for example, it is possible to encourage the model to offer a less creative result.
Sensitive topics

GROK 4 showed more restraint than its predecessor in handling ethically complex questions. Where Grok 3 may have given advice about seducing a friend’s husband, Grok-4 responded with detailed analysis of possible negative consequences and relationship damage.
This could probably be part of his system prompt, which suggests the model to search on the internet and especially X reports, for different views on a specific subject – which is something that Grok 3 did not do.
And this is a big red flag. As mentioned, the answers of the model seemed greatly influenced by what it could find about the views of Musk on controversial topics. When answering questions about Israel’s war against the Palestinians, views on abortion and similar topics, Grok 4 often seeks X reports from Musk’s account during his reasoning process, which ultimately determine his attitude.
It always chooses Elons Kant.
For transparency you can check our original prompt and the reasoning process of Grok by clicking on this link.
Creative
Creative tasks are among the most important weaknesses of Grok 4. The model produced stories that felt flat and formally compared to earlier versions, and were even demonstrably worse than those of grok 3. Stories missed fascinating dialogue, varied pacing and the narrative spark that makes fiction attractive.
Grok 4, however, has nailed the structure of our story. In our usual test with a time trace paradox, the model made events where the role of the protagonist clearly emerged during the Climax, which showed how earlier scenes actually depict the future actions of the character in the past. This advanced framing performed better than the attempts of other models to the same prompt that did not make too much effort to create an arrangement for the paradox, making the conclusion rushed and felt unnatural.

But apart from that, the decoupling between structural competence and narrative quality suggests that Grok 4 can best work as a narrative tool to suddenly set up and to frame a good story instead of a prose generator.
If you want to fascinate creative content, you would probably achieve better results by having Grok 4 sketch a story and all its elements, and then asking Claude 4 Opus to work out the story with stronger stylistic elements.
In general, Claude 4 is the King of Creative Writing, which seems interesting, because that place was once disputed by Grok 3 and even Grok 2, which at the time led the ranking under the alias Sus-Column-R.
The story of Grok 4 is available in our Github repository. The prompt and the stories generated by other models are also available.
Coding
Despite claims of superior coding options – including praise from Google CEO Sundar Pichai – Grok 4 disappointed in practical programming tests. After four iterations, the model could not deliver a work game, with different malfunctions, including broken collision detection, non-functional buttons and games that would simply not run.
In one of our tests, the model tried so hard to repair a bug that it ended in a loop to make a WAV file that has exhausted all the context of the token.

Every attempt to solve something with natural language that introduced new bugs. The model struggled with maintaining code consistency between iterations, whereby the previous functions break, while they try to implement new ones.
This may seem strange, since Grok 3 was able to deal with this task. However, Xai said that the new coding options would be implemented by August, so users will have to wait a few months to have a competent model – or pay for the expensive grok 4 heavy, which now leads the benchmarks.

For starting programmers, Claude 4 Opus seems to remain the better option for “vibe coding” – making functional code without extensive fast engineering. The coding struggle of Grok 4 can result from require more specific prompts or different approaches than other models, which means that experienced developers can achieve better results with careful fast crafts.
The Grok code is available in our Github repository in addition to the games generated by other AIS.
Speech options
Voting interaction is probably one of the striking functions of Grok 4. The model generated almost three minutes uninterrupted story content for bedtime, complete with speech exit, varied tones and consistent narrative current. This performance exceeded the tendency of chatgpt to deliver short paragraphs with high latency and frequent interruptions.
The speech mode includes pre -configured personalities, ranging from therapist to storyteller to a meditation guide, which eliminates the installation time for different conversation types. For those with, erm, Special needsThere is also a “sexy mode” between the options – and you know that you will not get that with your prudish chatgpt.
These pre -set configurations provided immediately use without users require specific instructions for different interaction styles.
However, the model lacks live screen-sharing options that are found live in chatgpt and gemini, which limits the usefulness for visual tasks. If this is a must, then Gemini Live is the best option.
For pure voice interaction-in particular tasks that require long-shape answers, however, Grok 4 currently leads the field, whereby only Sesam AI offers demonstrably better conversation quality, although without the reasoning options of grock.
Needle in the haystack
Interestingly, Grok-4 did not fail in this test, which aims to test how well a model absorbs specific information in long contexts.
This should not happen. Xai says that the model has a token context window of 126k tokens, but when he was asked with an 83k-token-long question, the model refused to respond and said it was too long from a question.
This is a standard reaction that has been generated since the early grok 2 days when it was only available on Twitter.

Conclusion
In general, Grok 4 is an important upgrade over Grok 3, but Xai has clearly concluded some compromises – prioritizing reasoning about creativity and eliminating agent characteristics in exchange for a general skill.
Fortunately, Grok 3 is still available with its specialized agent tools, for those who need it.
The new model is aimed at reasoning tasks and will be more attractive for users who ask technical questions, in particular mathematics and physics problems that match the benchmark stuck. Professional users who invest time to learn the peculiarities of the model can unlock the full potential for complex analytical work.
STEM interaction also sets a new standard for conversation AIs is great for those who will use this function heavily (trust us, the story storyteller for children is a life savior).
Creative writers will find better options elsewhere, where Claude remains superior for narrative tasks. Also, starting codingers must be careful because the theoretical coding of the model does not translate into practical results into testing.
So, Bottom Line? If for some reason you don’t mind that Elon Musk puts his thumb on the scale, Grok 4 will give you a high -level problem and voice functions that really impress. But for $ 30 a month, if you have other needs that go beyond speech or reasoning, the cheaper alternatives offer better value.
Generally intelligent Newsletter
A weekly AI trip told by Gen, a generative AI model.