The demand for identity-preserving facial aesthetic enhancement (FAE) has surged in social media and digital entertainment. However, existing methods based on deep generative models encounter difficulties in striking a balance between fine-grained detail enhancement and preserving the unique identities of individuals from diverse ethnic and gender backgrounds. To tackle this issue, this paper proposes a novel tuning-based framework that integrates prototype-based hierarchical prompt learning within a CLIP model and a StyleGAN-based inversion model. Our approach first adapts a pre-trained StyleGAN to the input face via pivotal tuning, optimizing around pivotal latent codes to minimize reconstruction distortion while retaining editability. Then, a prototype-based hierarchical prompt learning module is designed for learning multigrained facial features to achieve comprehensive and fine-grained facial descriptions for FAE. Specifically, we propose a prototypical similarity measure based on a multi-ethnic dataset to select geometrically similar faces with high aesthetic scores as reference faces. This selection is guided by ArcFace regularization within categorized gender and ethnic groups to minimize identity loss. Additionally, we design a novel aesthetic attribute selection algorithm to generate generic fine-grained aesthetic attributes from these reference faces for detailed facial descriptions. These components work synergistically through dynamic weight modulation, prioritizing features with high aesthetic contributions (such as enhancing lip fullness) while ensuring semantic consistency through CLIP-driven optimization for pivotal latent codes. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques in both aesthetic quality and identity preservation, especially for out-of-domain faces.