2013-01-28 16:49:11 +00:00
.TH pdf2htmlEX 1 "pdf2htmlEX @PDF2HTMLEX_VERSION@"
2012-09-10 17:21:08 +00:00
.SH NAME
.PP
.nf
pdf2htmlEX \- converts PDF to HTML without losing text and format.
.fi
.SH USAGE
.PP
.nf
pdf2htmlEX [options] <input\- filename> [<output\- filename>]
.fi
.SH DESCRIPTION
.PP
pdf2htmlEX is a utility that converts PDF files to HTML files.
2013-05-02 08:09:42 +00:00
pdf2htmlEX tries its best to render the PDF precisely, maintain proper styling, while retaining text and optimizing for Web.
2012-09-10 17:21:08 +00:00
Fonts are extracted form PDF and then embedded into HTML (Type 3 fonts are not supported). Text in the converted HTML file is usually selectable and copyable.
Other objects are rendered as images and also embedded.
.SH OPTIONS
2013-01-29 00:04:32 +00:00
2013-01-29 15:26:41 +00:00
.SS Pages
2013-01-29 00:04:32 +00:00
2012-09-12 16:16:34 +00:00
.TP
2012-09-10 17:21:08 +00:00
.B -f, --first-page <num> (Default: 1 )
Specify the first page to process
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
.B -l, --last-page <num> (Default: last page)
Specify the last page to process
2013-01-29 00:04:32 +00:00
2013-01-29 15:26:41 +00:00
.SS Dimensions
2013-01-29 00:04:32 +00:00
2013-05-26 23:43:26 +00:00
.TP
2012-09-26 16:17:56 +00:00
.B --zoom <ratio>, --fit-width <width>, --fit-height <height>
2012-09-26 16:25:41 +00:00
--zoom specifies the zoom factor directly; --fit-width/height specifies the maximum width/height of a page, the values are in pixels.
2012-09-26 16:17:56 +00:00
If multiple values are specified, the minimum one will be used.
If none is specified, pages will be rendered as 72DPI.
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
2013-04-30 11:07:55 +00:00
.B --use-cropbox <0|1> (Default: 1 )
2012-12-07 12:31:09 +00:00
Use CropBox instead of MediaBox for output.
2013-01-29 00:04:32 +00:00
2012-12-07 12:31:09 +00:00
.TP
2013-01-29 00:04:32 +00:00
.B --hdpi <dpi>, --vdpi <dpi> (Default: 144 )
Specify the horizontal and vertical DPI for images
2013-01-30 18:18:18 +00:00
.SS Output
2013-01-29 00:04:32 +00:00
2013-05-26 23:43:26 +00:00
.B --embed <string>
.br
.B --embed-css <0|1> (Default: 1 )
.br
.B --embed-font <0|1> (Default: 1 )
.br
.B --embed-image <0|1> (Default: 1 )
.br
.B --embed-javascript <0|1> (Default: 1 )
.br
.B --embed-outline <0|1> (Default: 1 )
.RS
Specify which elements should be embedded into the output HTML file.
If switched off, separated files will be generated along with the HTML file for the corresponding elements.
--embed accepts a string as argument. Each letter of the string must be one of `cCfFiIjJoO`, which corresponds
to one of the --embed-*** switches. Lower case letters for 0 and upper case letters for 1. For example,
`--embed cFIJo` means to embed everything but CSS files and outlines.
.RE
2012-09-10 17:21:08 +00:00
.TP
2012-09-12 15:26:14 +00:00
.B --split-pages <0|1> (Default: 0 )
2013-05-02 08:09:42 +00:00
If turned on, the content of each page is stored in a separated file.
2012-09-12 15:26:14 +00:00
2013-10-05 08:37:45 +00:00
This switch is useful if you want pages to be loaded separately & dynamically -- a supporting server might be necessary.
2013-03-18 04:31:43 +00:00
2013-10-05 08:37:45 +00:00
Also see --page-filename.
2013-03-18 04:31:43 +00:00
2013-10-05 08:37:45 +00:00
.TP
.B --dest-dir <dir> (Default: .)
Specify destination folder.
2013-01-28 18:47:51 +00:00
2013-10-05 08:37:45 +00:00
.TP
.B --css-filename <filename> (Default: <none>)
Specify the filename of the generated css file, if not embedded.
If it's empty, the file name will be determined automatically.
.TP
.B --page-filename <filename> (Default: <none>)
Specify the filename template for pages when --split-pages is 1
A %d placeholder may be included in `filename` to indicate where the page number should be placed. The placeholder supports a limited subset of normal numerical placeholders, including specified width and zero padding.
If `filename` does not contain a placeholder for the page number, the page number will be inserted directly before the file extension. If the filename does not have an extension, the page number will be placed at the end of the file name.
If --page-filename is not specified, <input-filename> will be used for the output filename, replacing the extension with .page and adding the page number directly before the extension.
2013-01-29 00:04:32 +00:00
2013-03-18 04:31:43 +00:00
.B Examples
.B pdf2htmlEX --split-pages 1 foo.pdf
Yields page files foo1.page, foo2.page, etc.
2013-10-05 08:37:45 +00:00
.B pdf2htmlEX --split-pages 1 foo.pdf --page-filename bar.baz
2013-03-18 04:31:43 +00:00
Yields page files bar1.baz, bar2.baz, etc.
2013-10-05 08:37:45 +00:00
.B pdf2htmlEX --split-pages 1 foo.pdf --page-filename page%dbar.baz
2013-03-18 04:31:43 +00:00
Yields page files page1bar.baz, page2bar.baz, etc.
2013-10-05 08:37:45 +00:00
.B pdf2htmlEX --split-pages 1 foo.pdf --page-filename bar%03d.baz
2013-03-18 04:31:43 +00:00
Yields page files bar001.baz, bar002.baz, etc.
2013-01-29 00:04:32 +00:00
.TP
.B --outline-filename <filename> (Default: <none>)
2013-03-17 05:08:06 +00:00
Specify the filename of the generated outline file, if not embedded.
2013-01-29 00:04:32 +00:00
If it's empty, the file name will be determined automatically.
2013-01-30 18:18:18 +00:00
.TP
.B --process-nontext <0|1> (Default: 1 )
Whether to process non-text objects (as images)
.TP
2013-04-30 07:58:26 +00:00
.B --process-outline <0|1> (Default: 1 )
2013-01-30 18:18:18 +00:00
Whether to show outline in the generated HTML
2013-04-30 11:07:55 +00:00
.TP
.B --printing <0|1> (Default: 1 )
Enable printing support. Disabling this option may reduce the size of CSS.
.TP
.B --fallback <0|1> (Deafult: 0 )
Output in fallback mode, for better accuracy and browser compatibility, but the size becomes larger.
2013-01-29 15:26:41 +00:00
.SS Fonts
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
2013-04-30 07:58:26 +00:00
.B --embed-external-font <0|1> (Default: 1 )
2013-05-03 19:29:10 +00:00
Specify whether the local matched fonts, for fonts not embedded in PDF, should be embedded into HTML.
2013-04-30 07:58:26 +00:00
2013-05-03 19:29:10 +00:00
If this switch is off, only font names are exported such that web browsers may try to find proper fonts themselves, and that might cause issues about incorrect font metrics.
2013-01-29 00:04:32 +00:00
.TP
2013-11-08 05:53:42 +00:00
.B --font-format <format> (Default: woff)
2013-09-18 12:24:48 +00:00
Specify the format of fonts extracted from the PDF file.
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
.B --decompose-ligature <0|1> (Default: 0 )
Decompose ligatures. For example 'fi' -> 'f''i'.
2013-01-29 00:04:32 +00:00
.TP
.B --auto-hint <0|1> (Default: 0 )
If set to 1, hints will be generated for the fonts using fontforge.
This may be preceded by --external-hint-tool.
.TP
.B --external-hint-tool <tool> (Default: <none>)
If specified, the tool will be called in order to enhanced hinting for fonts, this will precede --auto-hint.
2013-10-03 07:19:41 +00:00
The tool will be called as '<tool> <in.suffix> <out.suffix>', where suffix will be the same as specified for --font-format.
2013-01-29 00:04:32 +00:00
.TP
.B --stretch-narrow-glyph <0|1> (Default: 0 )
If set to 1, glyphs narrower than described in PDF will be stretched; otherwise space will be padded to the right of the glyphs
.TP
.B --squeeze-wide-glyph <0|1> (Default: 1 )
If set to 1, glyphs wider than described in PDF will be squeezed; otherwise it will be truncated.
2013-07-02 00:04:20 +00:00
.TP
.B --override-fstype <0|1> (Default: 0 )
Clear the fstype bits in TTF/OTF fonts.
Turn this on if Internet Explorer complains about 'Permission must be Installable' AND you have permission to do so.
2013-09-21 05:56:57 +00:00
.TP
.B --process-type3 <0|1> (Default: 0 )
If turned on, pdf2htmlEX will try to convert Type 3 fonts such that text can be rendered natively in HTML.
Otherwise all text with Type 3 fonts will be rendered as image.
This feature is highly experimental.
2013-01-29 15:26:41 +00:00
.SS Text
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
.B --heps <len>, --veps <len> (Default: 1 )
Specify the maximum tolerable horizontal/vertical offset (in pixels).
pdf2htmlEX would try to optimize the generated HTML file moving Text within this distance.
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
2013-03-30 21:32:12 +00:00
.B --space-threshold <ratio> (Default: 0 .125)
2012-09-10 17:21:08 +00:00
pdf2htmlEX would insert a whitespace character ' ' if the distance between two consecutive letters in the same line is wider than ratio * font_size.
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
2012-09-24 16:55:41 +00:00
.B --font-size-multiplier <ratio> (Default: 4 .0)
2012-09-10 17:21:08 +00:00
Many web browsers limit the minimum font size, and many would round the given font size, which results in incorrect rendering.
2012-09-21 09:05:01 +00:00
Specify a ratio greater than 1 would resolve this issue, however it might freeze some browsers.
2012-09-10 17:21:08 +00:00
For some versions of Firefox, however, there will be a problem when the font size is too large, in which case a smaller value should be specified here.
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
2013-04-04 14:10:25 +00:00
.B --space-as-offset <0|1> (Default: 0 )
2013-04-03 10:15:06 +00:00
If set to 1, space characters will be treated as offsets, which allows a better optimization.
2013-04-04 14:10:25 +00:00
For PDF files with bad encodings, turning on this option may cause losing characters.
2012-09-21 09:37:21 +00:00
2012-09-21 08:51:13 +00:00
.TP
2012-09-10 17:21:08 +00:00
.B --tounicode <-1|0|1> (Default: 0 )
A ToUnicode map may be provided for each font in PDF which indicates the 'meaning' of the characters. However often there is better "ToUnicode" info in Type 0/1 fonts, and sometimes the ToUnicode map provided is wrong.
If this value is set to 1, the ToUnicode Map is always applied, if provided in PDF, and characters may not render correctly in HTML if there are collisions.
If set to -1, a customized map is used such that rendering will be correct in HTML (visually the same), but you may not get correct characters by select & copy & paste.
2012-11-30 10:08:08 +00:00
If set to 0, pdf2htmlEX would try its best to balance the two methods above.
2013-01-29 00:04:32 +00:00
2013-04-03 01:06:32 +00:00
.TP
2013-05-06 03:08:29 +00:00
.B --optimize-text <0|1> (Deafult: 0 )
2013-04-03 01:06:32 +00:00
If set to 1, pdf2htmlEX will try to reduce the number of HTML elements used for text. Turn it off if anything goes wrong.
2013-09-18 10:01:56 +00:00
.SS Background Image
2013-09-18 12:24:48 +00:00
.TP
.B --bg-format <format> (Default: png)
Specify the background image format. Run `pdf2htmlEX -v` to check all supported formats.
2013-09-18 10:01:56 +00:00
2013-01-29 15:26:41 +00:00
.SS PDF Protection
2012-09-10 17:21:08 +00:00
.TP
2013-01-29 00:04:32 +00:00
.B -o, --owner-password <password>
Specify owner password
2012-09-30 16:37:53 +00:00
.TP
2013-01-29 00:04:32 +00:00
.B -u, --user-password <password>
Specify user password
2012-09-30 16:37:53 +00:00
.TP
2013-01-29 00:04:32 +00:00
.B --no-drm <0|1> (Default: 0 )
Override document DRM settings
2013-07-02 00:04:20 +00:00
Turn this on only when you have permission.
2013-01-29 15:26:41 +00:00
.SS Misc.
2013-01-29 00:04:32 +00:00
2012-09-10 17:21:08 +00:00
.TP
2013-01-29 00:04:32 +00:00
.B --clean-tmp <0|1> (Default: 1 )
If switched off, intermediate files won't be cleaned in the end.
2012-09-23 12:25:22 +00:00
2013-01-29 00:04:32 +00:00
.TP
.B --data-dir <dir> (Default: @CMAKE_INSTALL_PREFIX@/share/pdf2htmlEX)
2013-01-29 14:52:32 +00:00
Specify the folder holding the manifest and other files (see below for the manifest file)`
2013-01-29 00:04:32 +00:00
.TP
.B --css-draw <0|1> (Default: 0 )
Experimental and unsupported CSS drawing
2012-10-05 15:38:17 +00:00
.TP
2012-09-10 17:21:08 +00:00
.B --debug <0|1> (Default: 0 )
2013-01-29 00:04:32 +00:00
Print debug information.
2013-01-29 15:26:41 +00:00
.SS Meta
2013-01-29 00:04:32 +00:00
.TP
.B -v, --version
Print copyright and version info
.TP
.B --help
Print usage information
2013-01-29 14:52:32 +00:00
.SH MANIFEST and DATA-DIR
When split-pages is 0, the manifest file describes how the final html page should be generated.
By default, pdf2htmlEX will use the manifest in the default data-dir (run `pdf2htmlEX -v` to check), which gives a simple demo of its syntax.
You can modify the default one, or you can create a new one and specify the correct data-dir in the command line.
When single-html is 1, all files referred by the manifest must be located in the data-dir.
2012-09-10 17:21:08 +00:00
.SH EXAMPLE
.TP
.B pdf2htmlEX /path/to/file.pdf
Convert file.pdf into file.html
.TP
2012-09-30 16:37:53 +00:00
.B pdf2htmlEX --clean-tmp 0 --debug 1 /path/to/file.pdf
2012-09-10 17:21:08 +00:00
Convert file.pdf and leave all intermediate files.
.TP
2012-09-30 16:37:53 +00:00
.B pdf2htmlEX --dest-dir out --single-html 0 /path/to/file.pdf
2012-09-10 17:21:08 +00:00
Convert file.pdf into out/file.html and leave font/image files separated.
.SH COPYRIGHT
.PP
2013-01-26 06:25:25 +00:00
Copyright 2012,2013 Lu Wang <coolwanglu@gmail.com>
2012-09-10 17:21:08 +00:00
2013-05-11 06:53:03 +00:00
pdf2htmlEX is licensed under GPLv3 with additional terms, read LICENSE for details.
2012-09-10 17:21:08 +00:00
.SH AUTHOR
.PP
pdf2htmlEX is written by Lu Wang <coolwanglu@gmail.com>
.SH SEE ALSO
.TP
Home page
2013-05-05 03:34:43 +00:00
https://github.com/coolwanglu/pdf2htmlEX
2013-02-07 16:43:58 +00:00
.TP
pdf2htmlEX Wiki
https://github.com/coolwanglu/pdf2htmlEX/wiki