The high efficiency video coding standard provides excellent coding performance but is also very complex. Especially, the intra mode decision is very time-consuming due to the large number of available prediction modes and the flexible block partitioning scheme. In this paper, a highly parallel intra prediction algorithm for heterogeneous CPU+graphics processing unit (GPU) platforms is proposed, which accelerates the encoder dramatically. It is targeted toward high-quality high definition (HD) and ultra HD applications and utilizes prediction based on original samples (POSs), where the reference samples are generated from original pixels. This makes it possible to perform intramode prediction for all prediction blocks of a video frame concurrently. In addition, parallel-friendly cost functions are proposed which enable parallel rate distortion optimization with no synchronization overhead. A detailed statistical analysis of both POS and the proposed GPU intramethod is provided and the coding performance of the presented prototype is evaluated based on a large amount of experimental data. It is shown that the complexity of the intramode selection on the CPU is reduced by up to 78.03%. This translates to significant encoding time reductions of up to 64.52% for a single-threaded encoder and up to 94.82% in combination with wavefront parallel processing. In high bitrate ranges, average rate increases of only 2.11%-4.26% and 0.80%-2.34% are observed for the proposed high-speed and high-quality configurations, respectively. Furthermore, GPU intra is shown to be extremely efficient in lossless coding scenarios, where up to 53.37% time is saved with an average bitrate increase of only 0.55% among all test cases.