Evaluating AI-Generated Music

Part 2 of Research Blogs: After I had developed the Melody Generator I was interested in understanding how listeners interpreted its generated melodies relative to those composed by humans. Therefore, I created a survey to assess how producers and musicians perceived (i.e., how they interpreted) melodies generated by an AI system.

Context

Creative AI is heavily influenced by perception. The difference between how two melodies are perceived by listeners can be very subtle; even if two melodies have identical note structures, listeners may perceive the melodies differently due to variations in either timing and/or phrasing. The evaluation of AI-generated music is also difficult to accomplish as the “musical quality” of a piece is typically determined by both structural elements and the emotional resonance associated with the piece.

In order to assess this issue I conducted a survey of forty professional producers with at least one year of production experience. Each producer listened to short melody samples (typically 5 seconds long); fifty percent of the samples were created by human composers and fifty percent were created using the AI system used for generating melodies. After listening to each sample the producers assessed each sample based on the criteria of musicality, emotional tone, and rhythmic flow.

Technicalities

I built each pair of melodies to have similar rhythms and scales, so I would eliminate variables not associated with music composition. All clips were set to the same tempo and used the same synthesizer settings. The survey data were anonymous via a web-based survey tool that randomly ordered clip playback to minimize bias.

The evaluation criteria included:

  • Complexity: variety of pitch and rhythm without losing cohesion
  • Emotional tone: perceived expressiveness or mood
  • Originality: distinctiveness compared to typical melodic progressions

Survey respondents identified an interesting trend. Approximately 60% of respondents reported difficulty in identifying which of two melodies were composed by AI and which by a human. On average, however, human composers’ melodies received higher scores for emotional tone than did those generated by AI. Some respondents called their AI-generated melodies “clean but detached.” These comments suggest that, although models are able to generate musical structures, models continue to lack the capacity to express the intent and phrasing that humans instinctively include when composing music.

Conclusion

The survey highlighted the narrow boundary between imitation and artistry in AI music. Listeners recognized structure but sensed something missing in expression. The findings suggest that while AI can generate coherent melodies, emotional communication still depends on human input. True collaboration emerges when humans shape what algorithms can only approximate.

andrei.obreja2007@gmail.com

Seattle, Washington