Capítulo 11. Conversão de dados

Índice

11.1. Ferramentas de conversão de dados em texto
11.1.1. Converter um ficheiro de texto com o iconv
11.1.2. Verifica ficheiro se é UTF-8 com o iconv
11.1.3. Converter os nomes dos ficheiros com o iconv
11.1.4. conversão EOL
11.1.5. Conversão de TAB
11.1.6. Editores com auto-conversão
11.1.7. Extracção de texto simples
11.1.8. Highlighting and formatting plain text data
11.2. Dados XML
11.2.1. Dicas básicas para XML
11.2.2. Processamento de XML
11.2.3. The XML data extraction
11.3. Printable data
11.3.1. Ghostscript
11.3.2. Merge two PS or PDF files
11.3.3. Printable data utilities
11.3.4. Imprimir com o CUPS
11.4. Type setting
11.4.1. roff typesetting
11.4.2. TeX/LaTeX
11.4.3. Pretty print a manual page
11.4.4. Criando um manual
11.5. The mail data conversion
11.5.1. Mail data basics
11.6. Ferramentas de dados gráficos
11.7. Conversão de dados variados

Tools and tips for converting data formats on the Debian system are described.

Standard based tools are in very good shape but support for proprietary data formats are limited.

11.1. Ferramentas de conversão de dados em texto

Os seguinte pacotes para a conversão de dados de texto saltaram-me à vista.

Tabela 11.1. Lista de ferramentas de conversão de dados em texto

pacote popcon tamanho palavra chave descrição
libc6 * V:97, I:99 10012 conjunto e caracteres (charset) converter codificação de texto entre locales por iconv(1) (fundamental)
recode * V:1.5, I:7 772 charset+eol text encoding converter between locales (versatile, more aliases and features)
konwert * V:0.4, I:4 192 conjunto e caracteres (charset) conversor de codificação de texto entre locales (imaginativo)
nkf * V:0.2, I:2 300 conjunto e caracteres (charset) tradutor de conjunto de caracteres para Japonês
tcs * V:0.02, I:0.14 544 conjunto e caracteres (charset) tradutor de conjunto de caracteres
unaccent * V:0.01, I:0.09 76 conjunto e caracteres (charset) substitui letras acentuadas pelo seu equivalente não acentuado
tofrodos * V:1.1, I:7 48 eol conversor de formato de texto entre DOS e Unix: de dos(1) e para dos(1)
macutils * V:0.05, I:0.5 320 eol conversor de formato de texto entre Macintosh e Unix: de mac(1) e para mac(1)

11.1.1. Converter um ficheiro de texto com o iconv

[Dica] Dica

iconv(1) é disponibilizado como parte do pacote libc6 e está sempre disponível em praticamente todos os sistemas para converter a codificação de caracteres.

Você pode converter a codificação de um ficheiro de texto com o iconv(1) com o seguinte.

$ iconv -f codificação1 -t codificação2 entrada.txt >saída.txt

Os valores de codificação são sensíveis a maiúsculas/minúsculas e ignoram "-" e "_" para correspondência. As codificações suportadas podem ser verificadas pelo comando "iconv -l"

Tabela 11.2. Lista de valores de codificação e a sua utilização

valor de codificação utilização
ASCII. American Standard Code for Information Interchange, 7 bit code w/o accented characters
UTF-8 standard multilingue actual para todos os sistemas operativos modernos
ISO-8859-1 antigo standard para linguagens da Europa ocidental, ASCII + caracteres acentuados
ISO-8859-2 antigo standard para linguagens da Europa oriental, ASCII + caracteres acentuados
ISO-8859-15 antigo standard para linguagens da Europa ocidental, o ISO-8859-1 com o símbolo do euro
CP850 code page 850, Microsoft DOS characters with graphics for western European languages, ISO-8859-1 variant
CP932 code page 932, Microsoft Windows style Shift-JIS variant for Japanese
CP936 code page 936, Microsoft Windows style GB2312, GBK or GB18030 variant for Simplified Chinese
CP949 code page 949, Microsoft Windows style EUC-KR or Unified Hangul Code variant for Korean
CP950 code page 950, Microsoft Windows style Big5 variant for Traditional Chinese
CP1251 code page 1251, Microsoft Windows style encoding for the Cyrillic alphabet
CP1252 code page 1252, Microsoft Windows style ISO-8859-15 variant for western European languages
KOI8-R old Russian UNIX standard for the Cyrillic alphabet
ISO-2022-JP standard encoding for Japanese email which uses only 7 bit codes
eucJP old Japanese UNIX standard 8 bit code and completely different from Shift-JIS
Shift-JIS JIS X 0208 Appendix 1 standard for Japanese (see CP932)

[Nota] Nota

Some encodings are only supported for the data conversion and are not used as locale values (Secção 8.3.1, “Bases de codificação”).

For character sets which fit in single byte such as ASCII and ISO-8859 character sets, the character encoding means almost the same thing as the character set.

For character sets with many characters such as JIS X 0213 for Japanese or Universal Character Set (UCS, Unicode, ISO-10646-1) for practically all languages, there are many encoding schemes to fit them into the sequence of the byte data.

For these, there are clear differentiations between the character set and the character encoding.

The code page is used as the synonym to the character encoding tables for some vendor specific ones.

[Nota] Nota

Please note most encoding systems share the same code with ASCII for the 7 bit characters. But there are some exceptions. If you are converting old Japanese C programs and URLs data from the casually-called shift-JIS encoding format to UTF-8 format, use "CP932" as the encoding name instead of "shift-JIS" to get the expected results: 0x5C → "\" and 0x7E → "~" . Otherwise, these are converted to wrong characters.

[Dica] Dica

recode(1) may be used too and offers more than the combined functionality of iconv(1), fromdos(1), todos(1), frommac(1), and tomac(1). For more, see "info recode".

11.1.2. Verifica ficheiro se é UTF-8 com o iconv

You can check if a text file is encoded in UTF-8 with iconv(1) by the following.

$ iconv -f utf8 -t utf8 input.txt >/dev/null || echo "non-UTF-8 found"
[Dica] Dica

Use "--verbose" option in the above example to find the first non-UTF-8 character.

11.1.3. Converter os nomes dos ficheiros com o iconv

Aqui está um script exemplo para converter a codificação dos nomes de ficheiros daqueles criados sob sistemas operativos antigos para os modernos de UTF-8 num único directório.

#!/bin/sh
ENCDN=iso-8859-1
for x in *;
 do
 mv "$x" $(echo "$x" | iconv -f $ENCDN -t utf-8)
done

The "$ENCDN" variable should be set by the encoding value in Tabela 11.2, “Lista de valores de codificação e a sua utilização”.

For more complicated case, please mount a filesystem (e.g. a partition on a disk drive) containing such file names with proper encoding as the mount(8) option (see Secção 8.3.6, “Codificação de nomes de ficheiros”) and copy its entire contents to another filesystem mounted as UTF-8 with "cp -a" command.

11.1.4. conversão EOL

The text file format, specifically the end-of-line (EOL) code, is dependent on the platform.

Tabela 11.3. Lista de estilos EOL para diferentes plataformas

plataforma código EOL controle decimal hexadecimal
Debian (unix) LF ^J 10 0A
MSDOS e Windows CR-LF ^M^J 13 10 0D 0A
Macintosh da Apple CR ^M 13 0D

The EOL format conversion programs, fromdos(1), todos(1), frommac(1), and tomac(1), are quite handy. recode(1) is also useful.

[Nota] Nota

Some data on the Debian system, such as the wiki page data for the python-moinmoin package, use MSDOS style CR-LF as the EOL code. So the above rule is just a general rule.

[Nota] Nota

Most editors (eg. vim, emacs, gedit, …) can handle files in MSDOS style EOL transparently.

[Dica] Dica

The use of "sed -e '/\r$/!s/$/\r/'" instead of todos(1) is better when you want to unify the EOL style to the MSDOS style from the mixed MSDOS and Unix style. (e.g., after merging 2 MSDOS style files with diff3(1).) This is because todos adds CR to all lines.

11.1.5. Conversão de TAB

Existem alguns programas populares especializados para converter os códigos de tab.

Tabela 11.4. List of TAB conversion commands from bsdmainutils and coreutils packages

função bsdmainutils coreutils
expande tab para espaços "col -x" expand
contrai tab a partir de espaços "col -h" unexpand

indent(1) from the indent package completely reformats whitespaces in the C program.

Editor programs such as vim and emacs can be used for TAB conversion, too. For example with vim, you can expand TAB with ":set expandtab" and ":%retab" command sequence. You can revert this with ":set noexpandtab" and ":%retab!" command sequence.

11.1.6. Editores com auto-conversão

Intelligent modern editors such as the vim program are quite smart and copes well with any encoding systems and any file formats. You should use these editors under the UTF-8 locale in the UTF-8 capable console for the best compatibility.

An old western European Unix text file, "u-file.txt", stored in the latin1 (iso-8859-1) encoding can be edited simply with vim by the following.

$ vim u-file.txt

This is possible since the auto detection mechanism of the file encoding in vim assumes the UTF-8 encoding first and, if it fails, assumes it to be latin1.

An old Polish Unix text file, "pu-file.txt", stored in the latin2 (iso-8859-2) encoding can be edited with vim by the following.

$ vim '+e ++enc=latin2 pu-file.txt'

An old Japanese unix text file, "ju-file.txt", stored in the eucJP encoding can be edited with vim by the following.

$ vim '+e ++enc=eucJP ju-file.txt'

An old Japanese MS-Windows text file, "jw-file.txt", stored in the so called shift-JIS encoding (more precisely: CP932) can be edited with vim by the following.

$ vim '+e ++enc=CP932 ++ff=dos jw-file.txt'

When a file is opened with "++enc" and "++ff" options, ":w" in the Vim command line stores it in the original format and overwrite the original file. You can also specify the saving format and the file name in the Vim command line, e.g., ":w ++enc=utf8 new.txt".

Please refer to the mbyte.txt "multi-byte text support" in vim on-line help and Tabela 11.2, “Lista de valores de codificação e a sua utilização” for locale values used with "++enc".

The emacs family of programs can perform the equivalent functions.

11.1.7. Extracção de texto simples

The following reads a web page into a text file. This is very useful when copying configurations off the Web or applying basic Unix text tools such as grep(1) on the web page.

$ w3m -dump http://www.remote-site.com/help-info.html >textfile

Similarly, you can extract plain text data from other formats using the following.

Tabela 11.5. Lista de ferramentas para extracção de dados de texto simples

pacote popcon tamanho palavra chave função
w3m * V:24, I:84 1992 html→texto Conversor de HTML para texto com o comando "w3m -dump"
html2text * V:15, I:37 248 html→texto advanced HTML to text converter (ISO 8859-1)
lynx * I:22 252 html→texto HTML to text converter with the "lynx -dump" command
elinks * V:2, I:5 1448 html→texto HTML to text converter with the "elinks -dump" command
links * V:3, I:9 1380 html→texto HTML to text converter with the "links -dump" command
links2 * V:0.7, I:3 3288 html→texto HTML to text converter with the "links2 -dump" command
antiword * V:1.3, I:2 796 MSWord→texto,ps converte ficheiros do MSWord para texto simples ou ps
catdoc * V:1.0, I:2 2580 MSWord→texto,TeX converte ficheiros do MSWord para texto simples ou TeX
pstotext * V:0.8, I:1.4 148 ps/pdf→texto extrai texto de ficheiros PostScript e PDF
unhtml * V:0.02, I:0.14 76 html→texto remove as etiquetas de marcas de um ficheiro HTML
odt2txt * V:0.8, I:1.4 100 odt→texto conversor de texto do OpenDocument para texto
wpd2sxw * V:0.02, I:0.13 156 WordPerfect→sxw WordPerfect to OpenOffice.org/StarOffice writer document converter

11.1.8. Highlighting and formatting plain text data

You can highlight and format plain text data by the following.

Tabela 11.6. List of tools to highlight plain text data

pacote popcon tamanho palavra chave descrição
vim-runtime * V:3, I:38 25864 destaque Vim MACRO to convert source code to HTML with ":source $VIMRUNTIME/syntax/html.vim"
cxref * V:0.05, I:0.4 1252 c→html converter for the C program to latex and HTML (C language)
src2tex * V:0.03, I:0.2 1968 destaque convert many source codes to TeX (C language)
source-highlight * V:0.14, I:1.1 2164 destaque convert many source codes to HTML, XHTML, LaTeX, Texinfo, ANSI color escape sequences and DocBook files with highlight (C++)
highlight * V:0.2, I:1.3 756 destaque convert many source codes to HTML, XHTML, RTF, LaTeX, TeX or XSL-FO files with highlight (C++)
grc * V:0.05, I:0.12 164 text→color generic colouriser for everything (Python)
txt2html * V:0.08, I:0.5 296 texto→html text to HTML converter (Perl)
markdown * V:0.07, I:0.4 96 texto→html markdown text document formatter to (X)HTML (Perl)
asciidoc * V:0.15, I:1.1 3028 texto→qualquer AsciiDoc text document formatter to XML/HTML (Python)
python-docutils * V:0.4, I:3 5740 texto→qualquer ReStructured Text document formatter to XML (Python)
txt2tags * V:0.06, I:0.3 1028 texto→qualquer document conversion from text to HTML, SGML, LaTeX, man page, MoinMoin, Magic Point and PageMaker (Python)
udo * V:0.01, I:0.07 556 texto→qualquer universal document - text processing utility (C language)
stx2any * V:0.00, I:0.04 484 texto→qualquer document converter from structured plain text to other formats (m4)
rest2web * V:0.01, I:0.08 576 texto→html document converter from ReStructured Text to html (Python)
aft * V:0.01, I:0.06 340 texto→qualquer "free form" document preparation system (Perl)
yodl * V:0.01, I:0.06 564 texto→qualquer pre-document language and tools to process it (C language)
sdf * V:0.01, I:0.08 1940 texto→qualquer simple document parser (Perl)
sisu * V:0.01, I:0.07 14384 texto→qualquer document structuring, publishing and search framework (Ruby)

11.2. Dados XML

The Extensible Markup Language (XML) is a markup language for documents containing structured information.

See introductory information at XML.COM.

11.2.1. Dicas básicas para XML

XML text looks somewhat like HTML. It enables us to manage multiple formats of output for a document. One easy XML system is the docbook-xsl package, which is used here.

Cada ficheiro XML começa com a declaração XML standard como o seguinte.

<?xml version="1.0" encoding="UTF-8"?>

The basic syntax for one XML element is marked up as the following.

<name attribute="value">content</name>

XML element with empty content is marked up in the following short form.

<name attribute="value"/>

The "attribute="value"" in the above examples are optional.

The comment section in XML is marked up as the following.

<!-- comment -->

Other than adding markups, XML requires minor conversion to the content using predefined entities for following characters.

Tabela 11.7. Lista de entidades predefinidas para XML

entidade predefinida character to be converted from
&quot; " : quote
&apos; ' : apostrophe
&lt; < : less-than
&gt; > : greater-than
&amp; & : ampersand

[Cuidado] Cuidado

"<" or "&" can not be used in attributes or elements.

[Nota] Nota

When SGML style user defined entities, e.g. "&some-tag:", are used, the first definition wins over others. The entity definition is expressed in "<!ENTITY some-tag "entity value">".

[Nota] Nota

As long as the XML markup are done consistently with certain set of the tag name (either some data as content or attribute value), conversion to another XML is trivial task using Extensible Stylesheet Language Transformations (XSLT).

11.2.2. Processamento de XML

There are many tools available to process XML files such as the Extensible Stylesheet Language (XSL).

Basically, once you create well formed XML file, you can convert it to any format using Extensible Stylesheet Language Transformations (XSLT).

The Extensible Stylesheet Language for Formatting Object (XSL-FO) is supposed to be solution for formatting. The fop package is in the Debian contrib (not main) archive still. So the LaTeX code is usually generated from XML using XSLT and the LaTeX system is used to create printable file such as DVI, PostScript, and PDF.

Tabela 11.8. Lista de ferramentas XML

pacote popcon tamanho palavra chave descrição
docbook-xml * I:47 2488 xml XML document type definition (DTD) for DocBook
xsltproc * V:4, I:46 152 xslt XSLT command line processor (XML→ XML, HTML, plain text, etc.)
docbook-xsl * V:0.5, I:7 12792 xml/xslt XSL stylesheets for processing DocBook XML to various output formats with XSLT
xmlto * V:0.3, I:2 268 xml/xslt XML-to-any converter with XSLT
dblatex * V:0.2, I:2 7340 xml/xslt convert Docbook files to DVI, PostScript, PDF documents with XSLT
fop * V:0.3, I:2 2280 xml/xsl-fo converter ficheiros Docbook XML para PDF

Since XML is subset of Standard Generalized Markup Language (SGML), it can be processed by the extensive tools available for SGML, such as Document Style Semantics and Specification Language (DSSSL).

Tabela 11.9. Lista de ferramentas DSSL

pacote popcon tamanho palavra chave descrição
openjade * V:0.4, I:3 1212 dsssl ISO/IEC 10179:1996 standard DSSSL processor (latest)
openjade1.3 * V:0.02, I:0.14 2336 dsssl ISO/IEC 10179:1996 standard DSSSL processor (1.3.x series)
jade * V:0.3, I:2 1056 dsssl James Clark's original DSSSL processor (1.2.x series)
docbook-dsssl * V:0.5, I:4 3100 xml/dsssl DSSSL stylesheets for processing DocBook XML to various output formats with DSSSL
docbook-utils * V:0.2, I:2 440 xml/dsssl utilities for DocBook files including conversion to other formats (HTML, RTF, PS, man, PDF) with docbook2* commands with DSSSL
sgml2x * V:0.00, I:0.06 216 SGML/dsssl converter from SGML and XML using DSSSL stylesheets

[Dica] Dica

GNOME's yelp is sometimes handy to read DocBook XML files directly since it renders decently on X.

11.2.3. The XML data extraction

You can extract HTML or XML data from other formats using followings.

Tabela 11.10. List of XML data extraction tools

pacote popcon tamanho palavra chave descrição
wv * V:1.3, I:2 2116 MSWord→any document converter from Microsoft Word to HTML, LaTeX, etc.
texi2html * V:0.3, I:2 2076 texi→html conversor de Texinfo para HTML
man2html * V:0.2, I:1.2 372 manpage→html conversor de manual (manpage) para HTML (suporte a CGI)
tex4ht * V:0.3, I:2 924 tex↔html conversor entre (La)TeX e HTML
xlhtml * V:0.5, I:1.1 184 MSExcel→html conversor de .xls do MSExcel para HTML
ppthtml * V:0.5, I:1.1 120 MSPowerPoint→html conversor de MSPowerPoint para HTML
unrtf * V:0.4, I:0.9 224 rtf→html document converter from RTF to HTML, etc
info2www * V:0.6, I:1.2 156 info→html converter from GNU info to HTML (CGI support)
ooo2dbk * V:0.03, I:0.16 941 sxw→xml converter from OpenOffice.org SXW documents to DocBook XML
wp2x * V:0.01, I:0.07 240 WordPerfect→qualquer WordPerfect 5.0 and 5.1 files to TeX, LaTeX, troff, GML and HTML
doclifter * V:0.00, I:0.03 424 troff→xml conversor de troff para DocBook XML

For non-XML HTML files, you can convert them to XHTML which is an instance of well formed XML. XHTML can be processed by XML tools.

Tabela 11.11. Lista de ferramentas de impressão bonita de XML

pacote popcon tamanho palavra chave descrição
libxml2-utils * V:3, I:49 160 xml↔html↔xhtml command line XML tool with xmllint(1) (syntax check, reformat, lint, …)
tidy * V:1.0, I:9 108 xml↔html↔xhtml Verificador e reformatador de sintaxe HTML

Once proper XML is generated, you can use XSLT technology to extract data based on the mark-up context etc.

11.3. Printable data

Printable data is expressed in the PostScript format on the Debian system. Common Unix Printing System (CUPS) uses Ghostscript as its rasterizer backend program for non-PostScript printers.

11.3.1. Ghostscript

The core of printable data manipulation is the Ghostscript PostScript (PS) interpreter which generates raster image.

The latest upstream Ghostscript from Artifex was re-licensed from AFPL to GPL and merged all the latest ESP version changes such as CUPS related ones at 8.60 release as unified release.

Tabela 11.12. List of Ghostscript PostScript interpreters

pacote popcon tamanho descrição
ghostscript * V:18, I:56 6716 The GPL Ghostscript PostScript/PDF interpreter
ghostscript-x * V:13, I:28 220 GPL Ghostscript PostScript/PDF interpreter - X display support
gs-cjk-resource * V:0.04, I:0.4 4528 resource files for gs-cjk, Ghostscript CJK-TrueType extension
cmap-adobe-cns1 * V:0.03, I:0.3 1572 CMaps for Adobe-CNS1 (for traditional Chinese support)
cmap-adobe-gb1 * V:0.03, I:0.3 1552 CMaps for Adobe-GB1 (for simplified Chinese support)
cmap-adobe-japan1 * V:0.08, I:0.7 2428 CMaps for Adobe-Japan1 (for Japanese standard support)
cmap-adobe-japan2 * I:0.4 416 CMaps for Adobe-Japan2 (for Japanese extra support)
cmap-adobe-korea1 * V:0.01, I:0.19 872 CMaps for Adobe-Korea1 (for Korean support)
libpoppler5 * V:4, I:21 2368 Biblioteca de renderização de PDF baseada no visualizador de PDF xpdf
libpoppler-glib4 * V:7, I:19 504 PDF rendering library (GLib-based shared library)
poppler-data * I:3 12232 CMaps for PDF rendering library (for CJK support: Adobe-*)

[Dica] Dica

"gs -h" can display the configuration of Ghostscript.

11.3.2. Merge two PS or PDF files

You can merge two PostScript (PS) or Portable Document Format (PDF) files using gs(1) of Ghostscript.

$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pswrite -sOutputFile=bla.ps -f foo1.ps foo2.ps
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=bla.pdf -f foo1.pdf foo2.pdf
[Nota] Nota

The PDF, which is widely used cross-platform printable data format, is essentially the compressed PS format with few additional features and extensions.

[Dica] Dica

For command line, psmerge(1) and other commands from the psutils package are useful for manipulating PostScript documents. Commands in the pdfjam package work similarly for manipulating PDF documents. pdftk(1) from the pdftk package is useful for manipulating PDF documents, too.

11.3.3. Printable data utilities

The following packages for the printable data utilities caught my eyes.

Tabela 11.13. List of printable data utilities

pacote popcon tamanho palavra chave descrição
poppler-utils * V:8, I:49 536 pdf→ps,text,… PDF utilities: pdftops, pdfinfo, pdfimages, pdftotext, pdffonts
psutils * V:3, I:21 380 ps→ps PostScript document conversion tools
poster * V:1.2, I:9 80 ps→ps create large posters out of PostScript pages
xpdf-utils * V:0.9, I:4 76 pdf→ps,text,… PDF utilities: pdftops, pdfinfo, pdfimages, pdftotext, pdffonts
enscript * V:1.6, I:14 2464 text→ps, html, rtf convert ASCII text to PostScript, HTML, RTF or Pretty-Print
a2ps * V:1.7, I:8 4292 text→ps 'Anything to PostScript' converter and pretty-printer
pdftk * V:1.0, I:5 200 pdf→pdf PDF document conversion tool: pdftk
mpage * V:0.18, I:1.5 224 text,ps→ps print multiple pages per sheet
html2ps * V:0.2, I:1.7 260 html→ps converter from HTML to PostScript
pdfjam * V:0.2, I:1.8 228 pdf→pdf PDF document conversion tools: pdf90, pdfjoin, and pdfnup
gnuhtml2latex * V:0.07, I:0.6 60 html→latex converter from html to latex
latex2rtf * V:0.14, I:1.0 508 latex→rtf convert documents from LaTeX to RTF which can be read by MS Word
ps2eps * V:1.3, I:12 116 ps→eps converter from PostScript to EPS (Encapsulated PostScript)
e2ps * V:0.01, I:0.10 188 text→ps Text to PostScript converter with Japanese encoding support
impose+ * V:0.03, I:0.2 180 ps→ps PostScript utilities
trueprint * V:0.02, I:0.13 188 text→ps pretty print many source codes (C, C++, Java, Pascal, Perl, Pike, Sh, and Verilog) to PostScript. (C language)
pdf2svg * V:0.10, I:0.5 60 ps→svg converter from PDF to Scalable vector graphics format
pdftoipe * V:0.02, I:0.16 88 ps→ipe converter from PDF to IPE's XML format

11.3.4. Imprimir com o CUPS

Both lp(1) and lpr(1) commands offered by Common Unix Printing System (CUPS) provides options for customized printing the printable data.

You can print 3 copies of a file collated using one of the following commands.

$ lp -n 3 -o Collate=True filename
$ lpr -#3 -o Collate=True filename

You can further customize printer operation by using printer option such as "-o number-up=2", "-o page-set=even", "-o page-set=odd", "-o scaling=200", "-o natural-scaling=200", etc., documented at Command-Line Printing and Options.

11.4. Type setting

The Unix troff program originally developed by AT&T can be used for simple typesetting. It is usually used to create manpages.

TeX created by Donald Knuth is very powerful type setting tool and is the de facto standard. LaTeX originally written by Leslie Lamport enables a high-level access to the power of TeX.

Tabela 11.14. List of type setting tools

pacote popcon tamanho palavra chave descrição
texlive * V:0.5, I:9 124 (La)TeX TeX system for typesetting, previewing and printing
groff * V:0.9, I:7 9116 troff GNU troff text-formatting system

11.4.1. roff typesetting

Traditionally, roff is the main Unix text processing system. See roff(7), groff(7), groff(1), grotty(1), troff(1), groff_mdoc(7), groff_man(7), groff_ms(7), groff_me(7), groff_mm(7), and "info groff".

You can read or print a good tutorial and reference on "-me" macro in "/usr/share/doc/groff/" by installing the groff package.

[Dica] Dica

"groff -Tascii -me -" produces plain text output with ANSI escape code. If you wish to get manpage like output with many "^H" and "_", use "GROFF_NO_SGR=1 groff -Tascii -me -" instead.

[Dica] Dica

To remove "^H" and "_" from a text file generated by groff, filter it by "col -b -x".

11.4.2. TeX/LaTeX

The TeX Live software distribution offers a complete TeX system. The texlive metapackage provides a decent selection of the TeX Live packages which should suffice for the most common tasks.

There are many references available for TeX and LaTeX.

  • The teTeX HOWTO: The Linux-teTeX Local Guide
  • tex(1)
  • latex(1)
  • "The TeXbook", by Donald E. Knuth, (Addison-Wesley)
  • "LaTeX - A Document Preparation System", by Leslie Lamport, (Addison-Wesley)
  • "The LaTeX Companion", by Goossens, Mittelbach, Samarin, (Addison-Wesley)

This is the most powerful typesetting environment. Many SGML processors use this as their back end text processor. Lyx provided by the lyx package and GNU TeXmacs provided by the texmacs package offer nice WYSIWYG editing environment for LaTeX while many use Emacs and Vim as the choice for the source editor.

Existem muitos recursos online disponíveis.

When documents become bigger, sometimes TeX may cause errors. You must increase pool size in "/etc/texmf/texmf.cnf" (or more appropriately edit "/etc/texmf/texmf.d/95NonPath" and run update-texmf(8)) to fix this.

[Nota] Nota

The TeX source of "The TeXbook" is available at http://tug.ctan.org/tex-archive/systems/knuth/dist/tex/texbook.tex.

This file contains most of the required macros. I heard that you can process this document with tex(1) after commenting lines 7 to 10 and adding "\input manmac \proofmodefalse". It's strongly recommended to buy this book (and all other books from Donald E. Knuth) instead of using the online version but the source is a great example of TeX input!

11.4.3. Pretty print a manual page

You can print a manual page in PostScript nicely by one of the following commands.

$ man -Tps some_manpage | lpr
$ man -Tps some_manpage | mpage -2 | lpr

The second example prints 2 pages on one sheet.

11.4.4. Criando um manual

Although writing a manual page (manpage) in the plain troff format is possible, there are few helper packages to create it.

Tabela 11.15. Lista de pacotes para ajudar a criar o manual (manpage)

pacote popcon tamanho palavra chave descrição
docbook-to-man * V:0.3, I:2 240 SGML→manpage converter from DocBook SGML into roff man macros
help2man * V:0.13, I:1.1 376 texto→manpage automatic manpage generator from --help
info2man * V:0.02, I:0.15 204 info→manpage converter from GNU info to POD or man pages
txt2man * V:0.02, I:0.2 88 texto→manpage convert flat ASCII text to man page format

11.5. The mail data conversion

The following packages for the mail data conversion caught my eyes.

Tabela 11.16. List of packages to help mail data conversion

pacote popcon tamanho palavra chave descrição
sharutils * V:2, I:32 904 mail shar(1), unshar(1), uuencode(1), uudecode(1)
mpack * V:1.5, I:23 84 MIME encoder and decoder MIME messages: mpack(1) and munpack(1)
tnef * V:0.8, I:1.5 164 ms-tnef unpacking MIME attachments of type "application/ms-tnef" which is a Microsoft only format
uudeview * V:0.17, I:1.6 132 mail encoder and decoder for the following formats: uuencode, xxencode, BASE64, quoted printable, and BinHex
readpst * V:0.04, I:0.3 228 PST convert Microsoft Outlook PST files to mbox format

[Dica] Dica

The Internet Message Access Protocol version 4 (IMAP4) server (see Secção 6.7, “Servidor POP3/IMAP4”) may be used to move mails out from proprietary mail systems if the mail client software can be configured to use IMAP4 server too.

11.5.1. Mail data basics

Mail (SMTP) data should be limited to 7 bit. So binary data and 8 bit text data are encoded into 7 bit format with the Multipurpose Internet Mail Extensions (MIME) and the selection of the charset (see Secção 8.3.1, “Bases de codificação”).

The standard mail storage format is mbox formatted according to RFC2822 (updated RFC822). See mbox(5) (provided by the mutt package).

For European languages, "Content-Transfer-Encoding: quoted-printable" with the ISO-8859-1 charset is usually used for mail since there are not much 8 bit characters. If European text is encoded in UTF-8, "Content-Transfer-Encoding: quoted-printable" is likely to be used since it is mostly 7 bit data.

For Japanese, traditionally "Content-Type: text/plain; charset=ISO-2022-JP" is usually used for mail to keep text in 7 bits. But older Microsoft systems may send mail data in Shift-JIS without proper declaration. If Japanese text is encoded in UTF-8, Base64 is likely to be used since it contains many 8 bit data. The situation of other Asian languages is similar.

[Nota] Nota

If your non-Unix mail data is accessible by a non-Debian client software which can talk to the IMAP4 server, you may be able to move them out by running your own IMAP4 server (see Secção 6.7, “Servidor POP3/IMAP4”).

[Nota] Nota

If you use other mail storage formats, moving them to mbox format is the good first step. The versatile client program such as mutt(1) may be handy for this.

You can split mailbox contents to each message using procmail(1) and formail(1).

Each mail message can be unpacked using munpack(1) from the mpack package (or other specialized tools) to obtain the MIME encoded contents.

11.6. Ferramentas de dados gráficos

The following packages for the graphic data conversion, editing, and organization tools caught my eyes.

Tabela 11.17. Lista de ferramentas de dados gráficos

pacote popcon tamanho palavra chave descrição
gimp * V:12, I:44 13560 imagem(bitmap) GNU Image Manipulation Program
imagemagick * V:13, I:35 268 imagem(bitmap) image manipulation programs
graphicsmagick * V:1.6, I:3 4532 imagem(bitmap) image manipulation programs (folk of imagemagick)
xsane * V:5, I:36 748 imagem(bitmap) GTK+-based X11 frontend for SANE (Scanner Access Now Easy)
netpbm * V:4, I:29 4612 imagem(bitmap) ferramentas de conversão de gráficos
icoutils * V:0.3, I:1.3 200 png↔ico(bitmap) convert MS Windows icons and cursors to and from PNG formats (favicon.ico)
scribus * V:0.5, I:3 26888 ps/pdf/SVG/… Scribus DTP editor
openoffice.org-draw * V:18, I:40 10720 imagem(vector) OpenOffice.org office suite - drawing
inkscape * V:15, I:32 87436 imagem(vector) SVG (Scalable Vector Graphics) editor
dia-gnome * V:1.4, I:2 576 imagem(vector) editor de diagramas (GNOME)
dia * V:3, I:5 572 imagem(vector) editor de diagramas (Gtk)
xfig * V:2, I:4 1676 imagem(vector) facility for Interactive Generation of figures under X11
pstoedit * V:1.9, I:16 708 ps/pdf→imagem(vector) PostScript and PDF files to editable vector graphics converter (SVG)
libwmf-bin * V:1.4, I:13 68 Windows/imagem(vector) Windows metafile (vector graphic data) conversion tools
fig2sxd * V:0.03, I:0.2 200 fig→sxd(vector) convert XFig files to OpenOffice.org Draw format
unpaper * V:0.2, I:1.7 736 imagem→imagem post-processing tool for scanned pages for OCR
tesseract-ocr * V:0.7, I:3 3196 imagem→texto free OCR software based on the HP's commercial OCR engine
tesseract-ocr-eng * V:0.2, I:2 1752 imagem→texto OCR engine data: tesseract-ocr language files for English text
gocr * V:0.8, I:5 492 imagem→texto software de OCR livre
ocrad * V:0.4, I:4 364 imagem→texto software de OCR livre
gtkam * V:0.3, I:1.7 1100 imagem(Exif) manipular ficheiros de fotos de cameras digitais (GNOME) - GUI
gphoto2 * V:0.3, I:2 1008 imagem(Exif) manipular ficheiros de fotos de cameras digitais (GNOME) - linha de comandos
kamera * V:0.7, I:13 312 imagem(Exif) manipular ficheiros de fotos de cameras digitais (KDE)
jhead * V:0.5, I:3 132 imagem(Exif) manipulate the non-image part of Exif compliant JPEG (digital camera photo) files
exif * V:0.2, I:1.7 184 imagem(Exif) command-line utility to show EXIF information in JPEG files
exiftags * V:0.14, I:0.9 248 imagem(Exif) utility to read Exif tags from a digital camera JPEG file
exiftran * V:0.4, I:3 56 imagem(Exif) transformar imagens jpeg de cameras digitais
exifprobe * V:0.08, I:0.5 484 imagem(Exif) ler metadados de imagens digitais
dcraw * V:0.9, I:5 444 imagem(Raw)→ppm decode raw digital camera images
findimagedupes * V:0.06, I:0.4 140 image→fingerprint find visually similar or duplicate images
ale * V:0.02, I:0.17 768 imagem→imagem merge images to increase fidelity or create mosaics
imageindex * V:0.03, I:0.2 192 imagem(Exif)→html generate static HTML galleries from images
f-spot * V:0.5, I:1.8 9488 imagem(Exif) aplicação de gestão de fotos pessoais (GNOME)
bins * V:0.02, I:0.15 2008 imagem(Exif)→html generate static HTML photo albums using XML and EXIF tags
gallery2 * V:0.2, I:0.4 62548 imagem(Exif)→html generate browsable HTML photo albums with thumbnails
outguess * V:0.02, I:0.14 252 jpeg,png universal Steganographic tool
qcad * V:1.5, I:2 3944 DXF CAD data editor (KDE)
blender * V:0.5, I:3 28336 blend, TIFF, VRML, … 3D content editor for animation etc
mm3d * V:0.04, I:0.3 4536 ms3d, obj, dxf, … editor de modelos 3D baseado em OpenGL
open-font-design-toolkit * I:0.03 36 ttf, ps, … metapackage for open font design
fontforge * V:0.2, I:1.7 6612 ttf, ps, … font editor for PS, TrueType and OpenType fonts
xgridfit * V:0.01, I:0.07 1060 ttf program for gridfitting and hinting TrueType fonts
gbdfed * V:0.01, I:0.11 496 bdf editor for BDF fonts

[Dica] Dica

Search more image tools using regex "~Gworks-with::image" in aptitude(8) (see Secção 2.2.6, “Opções do método de pesquisa com o aptitude”).

Although GUI programs such as gimp(1) are very powerful, command line tools such as imagemagick(1) are quite useful for automating image manipulation with the script.

The de facto image file format of the digital camera is the Exchangeable Image File Format (EXIF) which is the JPEG image file format with additional metadata tags. It can hold information such as date, time, and camera settings.

The Lempel-Ziv-Welch (LZW) lossless data compression patent has been expired. Graphics Interchange Format (GIF) utilities which use the LZW compression method are now freely available on the Debian system.

[Dica] Dica

Any digital camera or scanner with removable recording media works with Linux through USB storage readers since it follows the Design rule for Camera Filesystem and uses FAT filesystem. See Secção 10.1.10, “Dispositivo de armazenamento amovível”.

11.7. Conversão de dados variados

There are many other programs for converting data. Following packages caught my eyes using regex "~Guse::converting" in aptitude(8) (see Secção 2.2.6, “Opções do método de pesquisa com o aptitude”).

Tabela 11.18. Lista de ferramentas de conversão de dados variados

pacote popcon tamanho palavra chave descrição
alien * V:1.2, I:11 244 rpm/tgz→deb converter for the foreign package into the Debian package
freepwing * V:0.00, I:0.03 568 EB→EPWING converter from "Electric Book" (popular in Japan) to a single JIS X 4081 format (a subset of the EPWING V1)

You can also extract data from RPM format with the following.

$ rpm2cpio file.src.rpm | cpio --extract