M BUZZ CRAZE NEWS
// general

Convert PDF or FDF to CSV?

By Emma Johnson

It seems crazy, but while Libre Base can export a form as PDF, there is no way to import the completed form data from the PDF into Libre Base.

One idea is to extract PDF form data to FDF (or unfriendly CSV) and then convert that to a friendly CSV at which point one can copy and paste CSV data into Libre Base.

  • How can I convert a PDF, FDF, or unfriendly CSV file to a friendly CSV with:
    • Row 1: "FieldName1, FieldName2, ..."
    • Row 2: "Value1, Value2, ..."

Useful files on Ge.tt until 2015,02,25:

  1. Super basic blank pdf with "Submit format" as FDF
    • Libre Base > Forms > Edit > Open Database Object --> File > Export as PDF
  2. Filled PDF made from (1)
    • Type data manually
  3. Extracted CSV data from (2)
    • pdftk filled_form-submit_format_fdf.pdf dump_data_fields output filled_form.csv
  4. Extracted FDF data from (2)
    • pdftk filled_form-submit_format_fdf.pdf generate_fdf output filled_form.fdf

*Note: You can also use Libre Base to export a blank form as XML or HTML, but I don't know how to input data into those files nor import that data back into Base.

11

3 Answers

How to convert FDF to CSV?

Option A:

awk -F "[()]" '{ if ($1=="/V ") value[$2];} \ END {printf( "CompanyName\tEmailAddress\t\tCountryOrRegion\n" ); \ for (x in value)printf("%s\t", x);print "" ; \ }' filled_form.fdf > filled_form.CSV

Option B:

grep -oP '(?<=\/T \(txt).*(?=\))' filled_form.fdf |awk '{ORS=(NR%3?",":RS)}1'; \
grep -oP '(?<=\/V \().*(?=\))' filled_form.fdf |awk '{ORS=(NR%3?",":RS)}1'; 

Short version of above command would be:

paste -sd, <(grep -oP '(?<=\/T \(txt).*(?=\))' filled_form.fdf) <(grep -oP '(?<=\/V \().*(?=\))' filled_form.fdf)

Option C:

awk 'NR%2==0{type[$0]} NR%2{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(grep -oP '(?<=\/T \(txt|\/V \().*(?=\))' filled_form.fdf)

How to convert unfriendly CSV to friendly CSV ?

Option A:

awk -F: '{ if ($1=="FieldValue") value[$2];} \ END {printf( "CountryOrRegion\tCompanyName\tEmailAddress\n" ); \ for (x in value)printf("%s\t", x) ;print ""; \ }' filled_form.csv > friendly_filled_form.CSV

Option B:

grep -oP '(?<=FieldName: txt).*' filled_form.csv |awk '{ORS=(NR%3?",":RS)}1'; \
grep -oP '(?<=FieldValue: ).*' filled_form.csv |awk '{ORS=(NR%3?",":RS)}1'

*Note that this command is one linear. so for running it you have to type/copy both line.

And short version for this one would be:

paste -sd, <(grep -oP '(?<=FieldName: txt).*' filled_form.csv) <(grep -oP '(?<=FieldValue: ).*' filled_form.csv)

Option C:

awk 'NR%2{type[$0]} NR%2==0{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(grep -oP '(?<=FieldName: txt|FieldValue: ).*' filled_form.csv)

or even this one:

awk 'NR%2{type[$0]} NR%2==0{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(awk -F'FieldValue: |FieldName: txt' 'NF>1{print $2}' filled_form.csv)

How to convert PDF to CSV?

I will complete the solution with pdfgrep tomorrow but if you want to try it for yourself, here is the command:

pdfgrep 'CompanyName|CountryOrRegion|EmailAddress' filled_form-submit_format_fdf.pdf

It needs to work on output format. If you want to get only whole words, use the -C 0 option. good luck and I hope helps ;)

1

Answering especially your comment here:

I wish Base could decode a PDF file that Base created

No, it is not reasonable to expect that any program should be able to read a PDF it generated, not any more than expecting a printer to be able to churn back a sheet it had just printed and giving you back your document.

PDF is a printing/archival format. It is basically(1) a series of instructions of the kind of "put the word hello at position x,y in the page". So when you see in a PDF:

example PDF

There is no way for a computer program to say if it's two rows by three columns or two rows by two columns containing spaces in it(2). The number 3 could be a 2.95 rounded up, even. You can't know because the info is simply not there. So no, you never think that going back from PDF to your data should be possible(3).

However, you can try to convert the information to things more manageable with, for example, pdftotext(4):

[romano:~/tmp] 1 % pdftotext Untitled1.pdf; cat Untitled1.txt
Sheet1
shiny
mint
new
used
3
1
Page 1

...and then use scripting, manual editing, whatever to massage it up.


Footnotes:

(1) really, really understatement here.

(2) It's 2 by 2, really.

(3) I have even seen PDFs where the creator program went long stretches to make this almost impossible, by putting each glyph independently and randomly in the page, to make tampering with it very difficult. No, I can't find the example any more...

(4) from the package poppler-utils

1

LibreOffice can produce a pdf form.

Getting the data therefrom to a CSV can be achieved using free Foxit Reader which works well under Bionic Beaver 18.04.

With a completed form loaded in Foxit Reader select "form" from the "connect" dropdown. A tool button should then offer "Export Form Data". The save window will then offer choices of XML, XFDF (whatever that is), TXT and CSV.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy