Skip to content

filak/hOCR-to-ALTO

Repository files navigation

hOCR-to-ALTO

Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets

The XSLT scripts use XSLT 2.0 features - so a XSLT 2.0 capable transformer is required - ie. Saxon

Running the conversion using Saxon-HE command line - example converting ALTO to hOCR:

  1. Unpack the Saxon distribution into the saxon subdir

  2. Place your input file(s) into the _input subdir

  3. Run:

     java -jar "saxon/saxon-he-12.7.jar" -s:_input/input-alto.xml -xsl:alto__hocr.xsl -o:_output/output-hocr.xml
    

    Or use the run-saxon script from bash:

    $ /.run-saxon input-alto.xml alto__hocr.xsl output-hocr.xml
    
  4. Check the _output dir.

See ocr-fileformat for an interface to using these stylesheets.

hOCR-spec https://github.com/kba/hocr-spec

File naming scheme: sourceFormatVersion__targetFormatVersion.xsl

CONTENTS

About

Convert between Tesseract hOCR and ALTO XML using XSL stylesheets

Topics

Resources

License

Stars

Watchers

Forks

Contributors 7