pdf2htmlEX/pdf2htmlEX/src/HTMLRenderer/text.cc

202 lines
6.2 KiB
C++
Raw Normal View History

2012-08-14 08:23:15 +00:00
/*
* text.cc
2012-08-14 08:23:15 +00:00
*
* Handling text & font, and relative stuffs
2012-08-14 08:23:15 +00:00
*
2012-10-05 15:38:17 +00:00
* Copyright (C) 2012 Lu Wang <coolwanglu@gmail.com>
2012-08-14 08:23:15 +00:00
*/
2013-02-05 14:07:51 +00:00
#include <algorithm>
2012-08-14 08:23:15 +00:00
#include "HTMLRenderer.h"
2013-04-06 08:45:01 +00:00
2012-11-29 09:28:05 +00:00
#include "util/namespace.h"
2012-11-29 09:45:26 +00:00
#include "util/unicode.h"
2012-08-14 08:23:15 +00:00
//#define HR_DEBUG(x) (x)
#define HR_DEBUG(x)
2012-09-12 15:26:14 +00:00
namespace pdf2htmlEX {
using std::none_of;
2012-11-29 10:28:07 +00:00
using std::cerr;
using std::endl;
2012-08-20 21:48:21 +00:00
2019-06-29 11:42:55 +00:00
void HTMLRenderer::drawString(GfxState * state, const GooString * s)
2012-08-14 08:23:15 +00:00
{
if(s->getLength() == 0)
return;
auto font = state->getFont();
2013-02-05 05:57:11 +00:00
double cur_letter_space = state->getCharSpace();
2013-02-05 06:21:07 +00:00
double cur_word_space = state->getWordSpace();
2013-12-22 08:59:59 +00:00
double cur_horiz_scaling = state->getHorizScaling();
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
bool drawChars = true;
// Writing mode fonts and Type 3 fonts are rendered as images
// I don't find a way to display writing mode fonts in HTML except for one div for each character, which is too costly
// For type 3 fonts, due to the font matrix, still it's hard to show it on HTML
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
if(state->getFont()
&& ( (state->getFont()->getWMode())
|| ((state->getFont()->getType() == fontType3) && (!param.process_type3))
|| (state->getRender() >= 4)
)
)
2012-08-14 08:23:15 +00:00
{
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
// We still want to go through the loop to ensure characters are added to the covered_chars array
drawChars = false;
//printf("%d / %d / %d\n", state->getFont()->getWMode(), (state->getFont()->getType() == fontType3), state->getRender());
2012-08-14 08:23:15 +00:00
}
// see if the line has to be closed due to state change
check_state_change(state);
2012-10-01 17:59:04 +00:00
prepare_text_line(state);
2012-08-14 08:23:15 +00:00
// Now ready to output
// get the unicodes
const char *p = (s->toStr()).c_str();
2012-08-14 08:23:15 +00:00
int len = s->getLength();
//accumulated displacement of chars in this string, in text object space
2012-08-14 08:23:15 +00:00
double dx = 0;
double dy = 0;
//displacement of current char, in text object space, including letter space but not word space.
double ddx, ddy;
//advance of current char, in glyph space
double ax, ay;
//origin of current char, in glyph space
2012-08-14 08:23:15 +00:00
double ox, oy;
int uLen;
CharCode code;
Unicode const *u = nullptr;
2012-08-14 08:23:15 +00:00
HR_DEBUG(printf("HTMLRenderer::drawString:len=%d\n", len));
2012-11-30 09:33:27 +00:00
while (len > 0)
{
auto n = font->getNextChar(p, len, &code, &u, &uLen, &ax, &ay, &ox, &oy);
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
HR_DEBUG(printf("HTMLRenderer::drawString:unicode=%lc(%d)\n", u ? (wchar_t)u[0] : ' ', u ? u[0] : -1));
2012-09-17 18:37:30 +00:00
2012-11-29 10:16:05 +00:00
if(!(equal(ox, 0) && equal(oy, 0)))
2012-08-14 08:23:15 +00:00
{
2012-08-14 09:13:29 +00:00
cerr << "TODO: non-zero origins" << endl;
2012-08-14 08:23:15 +00:00
}
ddx = ax * cur_font_size + cur_letter_space;
ddy = ay * cur_font_size;
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
double width = 0, height = font->getAscent();
if (font->isCIDFont()) {
char buf[2];
buf[0] = (code >> 8) & 0xff;
buf[1] = (code & 0xff);
2023-12-18 10:39:47 +00:00
width = ((GfxCIDFont *)font.get())->getWidth(buf, 2);
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
} else {
2023-12-18 10:39:47 +00:00
width = ((Gfx8BitFont *)font.get())->getWidth(code);
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
}
if (width == 0 || height == 0) {
//cerr << "CID: " << font->isCIDFont() << ", char:" << code << ", width:" << width << ", ax:" << ax << ", height:" << height << ", ay:" << ay << endl;
}
if (width == 0) {
width = ax;
if (width == 0) {
width = 0.001;
}
}
if (height == 0) {
height = ay;
if (height == 0) {
height = 0.001;
}
}
tracer.draw_char(state, dx, dy, width, height, !drawChars || inTransparencyGroup);
2014-06-14 19:44:28 +00:00
2012-09-07 00:39:21 +00:00
bool is_space = false;
2012-08-19 20:50:28 +00:00
if (n == 1 && *p == ' ')
{
2013-04-03 17:35:44 +00:00
/*
* This is by standard
* however some PDF will use ' ' as a normal encoding slot
* such that it will be mapped to other unicodes
2013-05-01 16:56:37 +00:00
* In that case, when space_as_offset is on, we will simply ignore that character...
2013-04-03 17:35:44 +00:00
*
* Checking mapped unicode may or may not work
2013-05-01 16:56:37 +00:00
* There are always ugly PDF files with no useful info at all.
2013-04-03 17:35:44 +00:00
*/
2012-09-07 00:39:21 +00:00
is_space = true;
2012-08-19 20:50:28 +00:00
}
2012-08-24 06:21:20 +00:00
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
2013-04-06 09:01:05 +00:00
if(is_space && (param.space_as_offset))
2012-09-06 07:09:47 +00:00
{
html_text_page.get_cur_line()->append_padding_char();
2013-12-22 08:59:59 +00:00
// ignore horiz_scaling, as it has been merged into CTM
html_text_page.get_cur_line()->append_offset((ax * cur_font_size + cur_letter_space + cur_word_space) * draw_text_scale);
2012-09-06 07:09:47 +00:00
}
else
{
if((param.decompose_ligature) && (uLen > 1) && none_of(u, u+uLen, is_illegal_unicode))
2012-09-07 00:39:21 +00:00
{
html_text_page.get_cur_line()->append_unicodes(u, uLen, ddx);
2012-09-07 00:39:21 +00:00
}
else
{
2013-04-05 10:07:37 +00:00
Unicode uu;
if(cur_text_state.font_info->use_tounicode)
{
2023-12-18 10:39:47 +00:00
uu = check_unicode(u, uLen, code, font.get());
}
else
{
2023-12-18 10:39:47 +00:00
uu = unicode_from_font(code, font.get());
}
html_text_page.get_cur_line()->append_unicodes(&uu, 1, ddx);
2013-04-05 10:07:37 +00:00
/*
* In PDF, word_space is appended if (n == 1 and *p = ' ')
2013-05-02 06:32:17 +00:00
* but in HTML, word_space is appended if (uu == ' ')
2013-04-05 10:07:37 +00:00
*/
2013-05-02 06:32:17 +00:00
int space_count = (is_space ? 1 : 0) - ((uu == ' ') ? 1 : 0);
2013-04-05 10:07:37 +00:00
if(space_count != 0)
2013-05-01 16:56:37 +00:00
{
html_text_page.get_cur_line()->append_offset(cur_word_space * draw_text_scale * space_count);
2013-05-01 16:56:37 +00:00
}
2012-09-07 00:39:21 +00:00
}
2012-09-06 07:09:47 +00:00
}
2012-08-14 08:23:15 +00:00
dx += ddx * cur_horiz_scaling;
dy += ddy;
if (is_space)
dx += cur_word_space * cur_horiz_scaling;
2012-08-14 08:23:15 +00:00
p += n;
len -= n;
}
cur_tx += dx;
cur_ty += dy;
2013-01-26 11:45:48 +00:00
draw_tx += dx;
2012-08-14 08:23:15 +00:00
draw_ty += dy;
}
2012-09-12 15:26:14 +00:00
bool HTMLRenderer::is_char_covered(int index)
{
2014-11-16 14:04:02 +00:00
auto covered = covered_text_detector.get_chars_covered();
if (index < 0 || index >= (int)covered.size())
{
std::cerr << "Warning: HTMLRenderer::is_char_covered: index out of bound: "
<< index << ", size: " << covered.size() <<endl;
New master (#2) * Show header in font map files * fix a usage of unique_ptr with array * Added '--quiet' argument to hide progress messages (resolves #503) * Revert cout messages to cerr (see #622) * bump version * fix build; fix some coverity warnings * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Many bug fixes and improvements, including: - Incorporated latest Cairo files from cairo-0.15.2 - Moved build to out-of-source - Added clean script - Rewritten correct_text_visibility option to improve accuracy - Transparent characters drawn on background layer - Improved bad unicode detection * Rationlise DPI to single number. Implement actual_dpi - clamp maximum background image size in cases of huge PDF pages * DPI fixes - increase DPI when partially covered text to covered-text-dpi Add font-style italic for oblique fonts Reduce char bbox for occlusion tests * Don't shrink bbox - not required if zoom=25 used * Ignore occlusion from stroke/fill with opacity < 0.5 Better compute char bbox for occlusion Use 10% inset for char bbox for occlusion Back out adding font-weight: bold to potentially bold fonts Fix bug to ensure CID ascent/descent matches subfont values * Removed zero char logging * Remove forced italic - missing italic is due to fontforge bug which needs fixing * Typos fixed, readme updated * Typos * Increase maximum background image width Fix private use range to avoid stupid mobile safari switching to emoji font * included -pthread switch to link included 3rdparty poppler files. * Updated files from poppler 0.59.0 and adjusted includes. * Support updated "Object" class from poppler 0.59.0
2018-01-10 19:31:38 +00:00
return true; // Something's gone wrong so assume covered so at least something is output
}
return covered[index];
}
2012-09-12 15:26:14 +00:00
} // namespace pdf2htmlEX