Foreign Languages of PDFs

Holography related topics.
unixboy
Posts: 74
Joined: Thu Jan 08, 2015 3:44 am

Foreign Languages of PDFs

Post by unixboy »

It's nice to see the addition of PDF collections in the forum. Holography is really an art medium for the science geeks and it greatly depends on the literature rather than textbooks. It has been long time since I downloaded the historical Lippmann, Meslin and Neuhauss publications but only to find in French. As the new PDFs are coming, I just wondering how do you read those non-English literature?

How do you read those French literature in scanned PDF? It's hard to generate plain text from those scanned PDF by OCR software so that I can use google translate to English. I am a Chinese and it took me years to study English until I can read English literature. But how about those in Russian and French? Do you all know the languages spoken in Europe? Are there any English version of Lippmann, Meslin and Neuhauss literature? Thanks a lot.
Martin

Foreign Languages of PDFs

Post by Martin »

unixboy wrote:Are there any English version of Lippmann, Meslin and Neuhauss literature?
To the best of my knowledge there are no English translations of Meslin and Neuhauss. In the early days (1891 - 1908) there have been very few English written articles on Lippmann photography. Most of the relevant publications were in French and German.
On the other hand, Bjelkhagen and others have largely covered that literature in the meantime.
a_k
Posts: 190
Joined: Thu Jan 15, 2015 10:52 pm

Foreign Languages of PDFs

Post by a_k »

It is indeed a problem and the quality (graphical not the content) of quite a few of the documents is really low. Even if you know the language they are hard to read. On the other hand i am glad that we have them at all. From what i have heard it was quite a pain for Martin / Colin to gather all the papers in the Lippmann collection, many of them as fax.

I'll have a look at them, maybe it is possible to convert some into text form.
unixboy
Posts: 74
Joined: Thu Jan 08, 2015 3:44 am

Foreign Languages of PDFs

Post by unixboy »

Martin wrote:
unixboy wrote:Are there any English version of Lippmann, Meslin and Neuhauss literature?
To the best of my knowledge there are no English translations of Meslin and Neuhauss. In the early days (1891 - 1908) there have been very few English written articles on Lippmann photography. Most of the relevant publications were in French and German.
On the other hand, Bjelkhagen and others have largely covered that literature in the meantime.
Thanks for telling me the information. Yes, I became to know the names of those scientists in early Lippmann photography by reading Hans Bjelkhagen's publications and books. But I am still eager to understand the original texts just as "You need to read only two groups of literature: the very latest, and the very earliest." (Quoted from Selected Papers on Three-Dimensional Displays, Page XV, ISBN 0-8194-3893-6.)
a_k
Posts: 190
Joined: Thu Jan 15, 2015 10:52 pm

Foreign Languages of PDFs

Post by a_k »

I made a test with http://holoforum.org/data/lippmann/Lipp ... e_1906.PDF

The program gs was used to convert it to .tif files, one per page

One of the tif files was converted into text with the open source OCR program tesseract, which is able to process the images in a language specific way.

This is the process in detail:

Download the pdf -> Lippmann_Photo_interferencielle_1906.PDF

Convert the pdf to image files:
gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=A4 -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- Lippmann_Photo_interferencielle_1906.PDF -> filename_0001.tif .. filename_0008.tif

Convert an image file into text in french:
tesseract filename_0001.tif filename_0001 -lfra --> filename_0001.txt

This is the result (which could need improvement):
filename_0001.txt
271 ACADEMIE DES SCIENCES.

J ’aurai l’honneur d’entretenir l’Académie des résultats obtenus par ces
diverses missions scientifiques. ' '

. __

OPTIQUE. — Des divers pritzcipes sur Zesqueis on peat fonia.i7er_'la pizotographie
directs des couleurs. Pfzotographia rfirecte des couleurs fondée sur la disper-
sion pnlcmatique. Note de M. G. LIPIPHIAANN-n

Pour qu’une épreuve photographique reproduise les couleurs dn Ino-
dele, deux conditions sont nécessaires :

1° La plaque sensible doit garder la trace des difiérences qui existent
entre les diverses radiations qui sont mélangées dan s un meme rayon inci-
dent; il faut, en d_’autres termes, que Ie systeme ernployé atria!)/se chaque
rayon incident: i

2° POW‘ 91119 1% liumiére it?-C_i_dentep soit _Ije_constitI_lée_apr(':S coup avec sa

1 change; before #1 4 seconds ago


The source document was one with a quite good quality but small font. Optimising the image files before processing might give better results, as might using better dictionaries. The end result certainly needs manual touch up, which makes it a quite time consuming process.
unixboy
Posts: 74
Joined: Thu Jan 08, 2015 3:44 am

Foreign Languages of PDFs

Post by unixboy »

a_k wrote:It is indeed a problem and the quality (graphical not the content) of quite a few of the documents is really low. Even if you know the language they are hard to read. On the other hand i am glad that we have them at all. From what i have heard it was quite a pain for Martin / Colin to gather all the papers in the Lippmann collection, many of them as fax.

I'll have a look at them, maybe it is possible to convert some into text form.
I downloaded all those treasure literature even if I don't understand French. The quality of printing is not a big issue as long as we could read the letters. In China, it is quite often for people to read or decipher the ancient texts carved in stones or bricks and many people have to understand the ancient Chinese when looking for family tree records. Anyway, thank you a lot for the contribution of literature collection, storage, organization and even translation of important holography literature.
unixboy
Posts: 74
Joined: Thu Jan 08, 2015 3:44 am

Foreign Languages of PDFs

Post by unixboy »

How surprise! You already made a test while I just input the text. Thank you very much for the detailed instruction of the OCR processing. I will try to do so.
a_k
Posts: 190
Joined: Thu Jan 15, 2015 10:52 pm

Foreign Languages of PDFs

Post by a_k »

You are welcome. I just made a test with one of the files with bad quality and the result was unreadable. Maybe there are better OCR programs around or the recognition quality of tesseract could be improved with appropriate options. I'll research somewhat further. If you find something that we could try, please let me know.
unixboy
Posts: 74
Joined: Thu Jan 08, 2015 3:44 am

Foreign Languages of PDFs

Post by unixboy »

a_k wrote:I just made a test with one of the files with bad quality and the result was unreadable.
This is exactly what problem I mean. Those fax PDFs are very hard for OCR to generate plain text. I probably need to manually input those French for some really interesting papers.
a_k
Posts: 190
Joined: Thu Jan 15, 2015 10:52 pm

Foreign Languages of PDFs

Post by a_k »

For some of the papers the main problem is that the text is skewed on the page. Rotating it so the text lines get horizontal gives much better chances that the OCR processing is successful. But then there are also other files where it really is difficult to even read. No matter how it is done, it is quite a task.
Post Reply