Previous changeset 1:b07fd3d7ffd0 (2022-06-16) Next changeset 3:1e2a13bcb5a7 (2023-04-03) |
Commit message:
planemo upload for repository https://github.com/RECETOX/galaxytools/tree/master/tools/recetox_aplcms commit 506df2aef355b3791567283e1a175914f06b405a |
modified:
macros.xml recetox_aplcms_align_features.xml utils.R |
added:
help.xml images/scheme.png mzml_id_getter.py |
removed:
macros_split.xml |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca help.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/help.xml Mon Feb 13 10:26:59 2023 +0000 |
b |
b'@@ -0,0 +1,255 @@\n+<macros>\n+\n+<token name="@GENERAL_HELP@">\n+General Information\n+===================\n+\n+Overview\n+--------\n+ \n+recetox-aplcms is a software package for peak detection in high resolution mass spectrometry (HRMS) data.\n+It supports reading .mzml files in raw profile mode and uses a bi-Gaussian chromatographic peak shape for feature detection and quantification.\n+\n+recetox-aplcms is based on the apLCMS package developed by Tianwei Yu at Emory University - see the citations and the apLCMS section beneath.\n+This version includes various software updates and is actively developed and maintained on `GitHub`_.\n+Please submit eventual bug reports as `issues`_ on the repository.\n+\n+.. _GitHub: https://github.com/RECETOX/recetox-aplcms\n+.. _issues: https://github.com/RECETOX/recetox-aplcms/issues/new\n+\n+\n+Workflow\n+--------\n+ \n+.. image:: https://raw.githubusercontent.com/RECETOX/galaxytools/aee0dd6cf6c05936269efe4337c50e27cc68e86b/tools/recetox_aplcms/images/scheme.png\n+ :width: 2560\n+ :height: 788\n+ :scale: 40\n+ :alt: A picture of a workflow diagram.\n+\n+The individual steps of the recetox-aplcms package can be combined in 2 separate workflows processing HRMS data in an unsupervised manner or by including a-priori knowledge.\n+The workflows consist of the following building blocks:\n+\n+(1) remove noise - denoise the raw data and extract the EIC\n+(2) generate feature table - group features in EIC into peaks using peak-shape model\n+(3) compute clusters - compute mz and rt clusters across samples\n+(4) compute template - find the template for rt correction\n+(5) correct time - correct the rt across samples using splines\n+(6) align features - align identical features across samples\n+(7) recover weaker signals - recover missed features in samples based on the aligned features\n+(8) merge known table - add known features to detected features table and vice versa\n+\n+For detailed documentation on the individual steps please see the individual tool wrappers.\n+\n+\n+apLCMS (Original Reference)\n+---------------------------\n+ \n+apLCMS is a software which generates a feature table from a batch of LC/MS spectra. The m/z and retention time\n+tolerance levels are estimated from the data. A run-filter is used to detect peaks and remove noise.\n+Non-parametric statistical methods are used to find-tune peak selection and grouping. After retention time\n+correction, a feature table is generated by aligning peaks across spectra. For further information on apLCMS\n+please refer to https://mypage.cuhk.edu.cn/academics/yutianwei/apLCMS/.\n+</token>\n+\n+<token name="@REMOVE_NOISE_HELP@">\n+recetox-aplcms - remove noise\n+=============================\n+\n+This tool is the first step of recetox-aplcms.\n+It removes noise from the raw data and performs a first clustering step of points with close m/z values into the extracted ion chromatograms (EICs).\n+Only peaks with a minimum elution length of `min_run` seconds are kept.\n+\n+Example Output\n+--------------\n+The raw data points contained in the scans of the `mzml` file are filtered for noise and grouped into clusters based on m/z values.\n+See an example output in the table below. The `group_number` column indicates the cluster index.\n+\n++----------------------+-------------------+-----------------------+--------------------+\n+| mz | rt | intensity | group_number |\n++======================+===================+=======================+====================+\n+| 70.01060119055192 | 350.58654 | 21178.330810546875 | 5 |\n++----------------------+-------------------+-----------------------+--------------------+\n+| 70.02334120404554 | 130.175262 | 287869.5478515625 | 10 |\n++----------------------+-------------------+-----------------------+--------------------+\n+| 70.0287408273165 | 134.801352 | 60883.15185546875 | 11 |\n++----------------------+-'..b' 3 | 1 | 1 | 1 |\n++-------+--------------+--------------+---------------+----------------+---------------+---------------+-----------+------------------------+------------------------+------------------------+\n+| 2 | 70.06505677 | 70.065045 | 70.0650676 | 141.9560055 | 140.5762528 | 143.335758 | 2 | 1 | 0 | 1 |\n++-------+--------------+--------------+---------------+----------------+---------------+---------------+-----------+------------------------+------------------------+------------------------+\n+| 57 | 78.04643252 | 78.046429 | 78.0464325 | 294.0063397 | 293.9406777 | 294.072001 | 2 | 1 | 1 | 0 |\n++-------+--------------+--------------+---------------+----------------+---------------+---------------+-----------+------------------------+------------------------+------------------------+\n+| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |\n++-------+--------------+--------------+---------------+----------------+---------------+---------------+-----------+------------------------+------------------------+------------------------+\n+\n+Intensity Table\n+~~~~~~~~~~~~~~~\n+This table contains the peak area for aligned features in all samples.\n+\n++-------+------------------------+------------------------+------------------------+\n+| id | 21_qc_no_dil_milliq | 29_qc_no_dil_milliq | 8_qc_no_dil_milliq |\n++=======+========================+========================+========================+\n+| 1 | 13187487.20482895 | 7957395.699119729 | 11700594.397257797 |\n++-------+------------------------+------------------------+------------------------+\n+| 2 | 2075168.6398983458 | 0 | 2574362.159289044 |\n++-------+------------------------+------------------------+------------------------+\n+| 57 | 2934524.4406785755 | 1333044.5065971944 | 0 |\n++-------+------------------------+------------------------+------------------------+\n+| ... | ... | ... | ... |\n++-------+------------------------+------------------------+------------------------+\n+\n+Retention Time Table\n+~~~~~~~~~~~~~~~~~~~~\n+This table contains the retention times for all aligned features in all samples.\n+\n++-------+------------------------+------------------------+------------------------+\n+| id | 21_qc_no_dil_milliq | 29_qc_no_dil_milliq | 8_qc_no_dil_milliq |\n++=======+========================+========================+========================+\n+| 1 | 294.09792478513236 | 294.1499853056912 | 294.0634942428341 |\n++-------+------------------------+------------------------+------------------------+\n+| 2 | 140.57625284242982 | 0 | 143.33575827589172 |\n++-------+------------------------+------------------------+------------------------+\n+| 57 | 294.07200187644435 | 293.9406777222317 | 0 |\n++-------+------------------------+------------------------+------------------------+\n+| ... | ... | ... | ... |\n++-------+------------------------+------------------------+------------------------+\n+</token>\n+\n+<token name="@MERGE_KNOWN_TABLES_HELP@">\n+recetox-aplcms - merge known table\n+==================================\n+\n+This tool allows merging the detected features back into the table of known features and vice versa.\n+It is used in the hybrid version of recetox-aplcms to augment the aligned feature table with the suspect peaks \n+and to augment this table with successfully detected features.\n+</token>\n+</macros>\n' |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca images/scheme.png |
b |
Binary file images/scheme.png has changed |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca macros.xml --- a/macros.xml Thu Jun 16 10:26:58 2022 +0000 +++ b/macros.xml Mon Feb 13 10:26:59 2023 +0000 |
[ |
b'@@ -1,11 +1,9 @@\n <macros>\r\n- <token name="@TOOL_VERSION@">0.9.4</token>\r\n+ <token name="@TOOL_VERSION@">0.10.1</token>\r\n <xml name="requirements">\r\n <requirements>\r\n- <requirement type="package" version="4.1.0">r-base</requirement>\r\n- <requirement type="package" version="4.0.1">r-arrow</requirement>\r\n <requirement type="package" version="@TOOL_VERSION@">r-recetox-aplcms</requirement>\r\n- <requirement type="package" version="1.0.7">r-dplyr</requirement>\r\n+ <requirement type="package" version="2.5.2">pymzml</requirement>\r\n </requirements>\r\n </xml>\r\n \r\n@@ -31,6 +29,11 @@\n familyName="Novotn\xc3\xbd"\r\n url="https://github.com/xtracko"\r\n identifier="0000-0001-5449-3523" />\r\n+ <person\r\n+ givenName="Helge"\r\n+ familyName="Hecht"\r\n+ url="https://github.com/hechth"\r\n+ identifier="0000-0001-6744-996X" />\r\n <organization\r\n url="https://www.recetox.muni.cz/"\r\n email="GalaxyToolsDevelopmentandDeployment@space.muni.cz"\r\n@@ -38,157 +41,104 @@\n </creator>\r\n </xml>\r\n \r\n- <xml name="inputs">\r\n- <inputs>\r\n- <param name="files" type="data" format="mzdata,mzml,mzxml,netcdf" multiple="true" min="3" label="data"\r\n- help="Mass spectrometry files for peak extraction." />\r\n- <yield />\r\n- </inputs>\r\n- </xml>\r\n-\r\n- <xml name="history_db">\r\n- <param name="known_table" type="data" format="parquet" label="known_table"\r\n- help="A data table containing the known metabolite ions and previously found features. The table must contain these 18 columns: chemical_formula (optional), HMDB_ID (optional), KEGG_compound_ID (optional), neutral.mass (optional), ion.type (the ion form - optional), m.z (either theoretical or mean observed m/z value of previously found features), Number_profiles_processed (the total number of processed samples to build this database), Percent_found (the percentage of historically processed samples in which the feature appeared), mz_min (minimum observed m/z value), mz_max (maximum observed m/z value), RT_mean (mean observed retention time), RT_sd (standard deviation of observed retention time), RT_min (minimum observed retention time), RT_max (maximum observed retention time), int_mean.log. (mean observed log intensity), int_sd.log. (standard deviation of observed log intensity), int_min.log. (minimum observed log intensity), int_max.log. (maximum observed log intensity)." />\r\n- <section name="history_db" title="Known-Table settings">\r\n- <param name="match_tol_ppm" type="integer" optional="true" min="0" label="match_tol_ppm (optional)"\r\n- help="The ppm tolerance to match identified features to known metabolites/features." />\r\n- <param name="new_feature_min_count" type="integer" value="2" min="1" label="new_feature_min_count"\r\n- help="The minimum number of occurrences of a historically unseen (unknown) feature to add this feature into the database of known features." />\r\n- </section>\r\n+ <xml name="remove_noise_params">\r\n+ <param name="min_pres" type="float" value="0.5" label="min_pres"\r\n+ help="The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak." />\r\n+ <param name="min_run" type="float" value="12" label="min_run"\r\n+ help="The minimum length of elution time for a series of signals grouped by m/z to be considered a peak." />\r\n+ <param name="mz_tol" type="float" value="1e-05" label="mz_tol"\r\n+ help="The m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machi'..b'riginal" label="Use custom RECETOX output format?" />\r\n- </section>\r\n- </xml>\r\n-\r\n- <xml name="unsupervised_outputs">\r\n- <data name="recovered_feature_sample_table" format="parquet" label="${tool.name} recovered_feature_sample_table on ${on_string}" />\r\n- <data name="aligned_feature_sample_table" format="parquet" label="${tool.name} aligned_feature_sample_table on ${on_string}" hidden="true" />\r\n- <yield />\r\n+ <xml name="bandwidth_params">\r\n+ <param name="bandwidth" type="float" value="0.5" label="bandwidth"\r\n+ help="A value between zero and one. Multiplying this value to the length of the signal along\r\n+ the time axis helps determine the bandwidth in the kernel smoother used for peak identification." />\r\n+ <param name="min_bandwidth" type="float" optional="true" label="min_bandwidth"\r\n+ help="The minimum bandwidth to use in the kernel smoother." />\r\n+ <param name="max_bandwidth" type="float" optional="true" label="max_bandwidth"\r\n+ help="The maximum bandwidth to use in the kernel smoother." />\r\n </xml>\r\n \r\n <xml name="citations">\r\n@@ -197,51 +147,8 @@\n <citation type="doi">10.1186/1471-2105-11-559</citation>\r\n <citation type="doi">10.1021/pr301053d</citation>\r\n <citation type="doi">10.1093/bioinformatics/btu430</citation>\r\n+ <citation type="doi">10.1038/s41598-020-70850-0</citation>\r\n <yield />\r\n </citations>\r\n </xml>\r\n-\r\n- <token name="@HELP_hybrid@">\r\n- <![CDATA[\r\n- This is the Hybrid version of apLCMS which is incorporating the knowledge of known metabolites and historically\r\n- detected features on the same machinery to help detect and quantify lower-intensity peaks.\r\n-\r\n- CAUTION: To use such knowledge, especially historical data, you must keep using (1) the same chromatography\r\n- system (otherwise the retention time will not match), and (2) the same type of samples with similar extraction\r\n- technique, such as human serum.\r\n-\r\n- @GENERAL_HELP@\r\n- ]]>\r\n- </token>\r\n-\r\n- <token name="@HELP_unsupervised@">\r\n- <![CDATA[\r\n- This is the Unsupervised version of apLCMS which is not relying on any existing knowledge about metabolites or\r\n- any historically detected features. For such functionality please use the Hybrid version of apLCMS.\r\n-\r\n- @GENERAL_HELP@\r\n- ]]>\r\n- </token>\r\n-\r\n- <token name="@HELP_two-step-hybrid@">\r\n- <![CDATA[\r\n- This is the **Two-Step Hybrid** version of **apLCMS**. This tool is improved upon the Hybrid version by accounting for the batch\r\n- effects in multi-batch experiments. As in the Hybrid version, this tool incorporates the knowledge of known metabolites and\r\n- historically detected features on the same machinery to help detect and quantify lower-intensity peaks.\r\n-\r\n- **CAUTION**: To use such knowledge, especially historical data, you must keep using (1) the same chromatography\r\n- system (otherwise the retention time will not match), and (2) the same type of samples with similar extraction\r\n- technique, such as human serum.\r\n-\r\n- @GENERAL_HELP@\r\n- ]]>\r\n- </token>\r\n-\r\n- <token name="@GENERAL_HELP@">\r\n- apLCMS is a software which generates a feature table from a batch of LC/MS spectra. The m/z and retention time\r\n- tolerance levels are estimated from the data. A run-filter is used to detect peaks and remove noise.\r\n- Non-parametric statistical methods are used to find-tune peak selection and grouping. After retention time\r\n- correction, a feature table is generated by aligning peaks across spectra. For further information on apLCMS\r\n- please refer to https://mypage.cuhk.edu.cn/academics/yutianwei/apLCMS/.\r\n- </token>\r\n </macros>\r\n' |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca macros_split.xml --- a/macros_split.xml Thu Jun 16 10:26:58 2022 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,23 +0,0 @@ -<macros> - <xml name="noise_filtering_split"> - <section name="noise_filtering" title="Noise filtering and peak detection"> - <param name="min_pres" type="float" value="0.5" - label="min_pres" - help="The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak." /> - <param name="min_run" type="float" value="12" - label="min_run" - help="The minimum length of elution time for a series of signals grouped by m/z to be considered a peak." /> - <param name="mz_tol" type="float" value="1e-05" - label="mz_tol" - help="The m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machine's nominal accuracy level. Divide the ppm value by 1e6. For FTMS, 1e-5 is recommended." /> - <param name="baseline_correct" type="float" value="0" label="baseline_correct" - help="After grouping the observations, the highest intensity in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, in which case the program uses a percentile of the height of the noise groups. If given a value, the value will be used as the threshold, and baseline.correct.noise.percentile will be ignored." /> - <param name="baseline_correct_noise_percentile" type="float" value="0.05" - label="baseline_correct_noise_percentile" - help="The percentile of signal strength of those EIC that don't pass the run filter, to be used as the baseline threshold of signal strength." /> - <param name="intensity_weighted" type="boolean" checked="false" truevalue="TRUE" falsevalue="FALSE" - label="intensity_weighted" - help="Whether to weight the local density by signal intensities in initial peak detection." /> - </section> - </xml> -</macros> \ No newline at end of file |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca mzml_id_getter.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/mzml_id_getter.py Mon Feb 13 10:26:59 2023 +0000 |
[ |
@@ -0,0 +1,23 @@ +#!/usr/bin/env python + +import argparse +import sys + +from pymzml.run import Reader + + +def main(argv): + parser = argparse.ArgumentParser(description='Get run ID from an mzML file.') + parser.add_argument('mzml_file', help='Path to an mzML file to get run ID from.') + args = parser.parse_args() + + mzml = Reader(args.mzml_file) + id = mzml.info['run_id'] + + if id is not None: + with open("sample_name.txt", mode='x') as f: + f.write(id) + + +if __name__ == '__main__': + main(sys.argv[1:]) |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca recetox_aplcms_align_features.xml --- a/recetox_aplcms_align_features.xml Thu Jun 16 10:26:58 2022 +0000 +++ b/recetox_aplcms_align_features.xml Mon Feb 13 10:26:59 2023 +0000 |
[ |
@@ -1,91 +1,63 @@ -<tool id="recetox_aplcms_align_features" name="RECETOX apLCMS - align features" version="@TOOL_VERSION@+galaxy1"> - <description>align features from LC/MS spectra across samples</description> +<tool id="recetox_aplcms_align_features" name="recetox-aplcms - align features" version="@TOOL_VERSION@+galaxy0"> + <description>align peaks across samples</description> <macros> <import>macros.xml</import> - <import>macros_split.xml</import> + <import>help.xml</import> </macros> <expand macro="creator"/> + <expand macro="requirements"/> - <expand macro="requirements"/> <command detect_errors="aggressive"><![CDATA[ - sh ${symlink_inputs} && Rscript -e 'source("${__tool_directory__}/utils.R")' -e 'source("${run_script}")' ]]></command> <configfiles> - <configfile name="symlink_inputs"> - #for $infile in $ms_files - ln -s '${infile}' '${infile.element_identifier}' - #end for - #for $infile in $corrected_files - ln -s '${infile}' '${infile.element_identifier}' - #end for - </configfile> <configfile name="run_script"><![CDATA[ - #set filenames_str = str("', '").join([str($f.element_identifier) for $f in $ms_files]) - files_list <- sort_samples_by_acquisition_number(c('$filenames_str')) - sample_names <- get_sample_name(files_list) + #set filenames = str("', '").join([str($f) for $f in $files]) + feature_tables <- load_parquet_collection(c('$filenames')) + sample_names <- unlist(lapply(feature_tables, load_sample_name)) + + validate_sample_names(sample_names) + + ordering <- order(sample_names) + feature_tables <- feature_tables[ordering] + sample_names <- sample_names[ordering] - #set corrected_files = str("', '").join([str($f.element_identifier) for $f in $corrected_files]) - corrected_features <- load_features(c('$corrected_files')) + tolerances <- load_data_from_parquet_file('$input_tolerances') - aligned <- align_features( - sample_names = sample_names, - features = corrected_features, - min.exp = $min_exp, - mz.tol = $peak_alignment.align_mz_tol, - chr.tol = $peak_alignment.align_chr_tol, - find.tol.max.d = 10 * $mz_tol, - max.align.mz.diff = $peak_alignment.max_align_mz_diff, - do.plot = FALSE - ) + aligned_features <- create_aligned_feature_table( + features_table = dplyr::bind_rows(feature_tables), + min_occurrence = $min_occurrence, + sample_names = sample_names, + mz_tol_relative = get_mz_tol(tolerances), + rt_tol_relative = get_rt_tol(tolerances) + ) - save_aligned_features(aligned, "$rt_cross_table", "$int_cross_table", "$tolerances") + save_aligned_features(aligned_features, '$metadata_file', '$rt_file', '$intensity_file') ]]></configfile> </configfiles> <inputs> - <param name="ms_files" type="data_collection" collection_type="list" format="mzdata,mzml,mzxml,netcdf" - label="Input data collection" help="Mass spectrometry file for peak extraction." /> - <param name="corrected_files" type="data_collection" collection_type="list" format="parquet" - label="Input corrected feature samples collection" - help="Mass spectrometry files containing corrected feature samples." /> - <expand macro="mz_tol_macro"/> - <param name="min_exp" type="integer" min="1" value="2" label="min_exp" - help="If a feature is to be included in the final feature table, it must be present in at least this number of spectra." /> - <expand macro="peak_alignment"/> + <param name="files" type="data_collection" collection_type="list" format="parquet" + label="Clustered features" help="List of tables containing clustered features." /> + <param label="Input tolerances values" name="input_tolerances" type="data" format="parquet" + help="Table containing tolerance values." /> + <param name="min_occurrence" type="integer" min="2" value="2" label="min_occurrence" + help="A feature has to show up in at least this number of profiles to be included in the final result." /> </inputs> <outputs> - <data name="tolerances" format="parquet" label="${tool.name} on ${on_string} (tolerances)" /> - <data name="rt_cross_table" format="parquet" label="${tool.name} on ${on_string} (rt cross table)" /> - <data name="int_cross_table" format="parquet" label="${tool.name} on ${on_string} (int cross table)" /> + <data name="metadata_file" format="parquet" label="${tool.name} on ${on_string} (metadata table)"/> + <data name="rt_file" format="parquet" label="${tool.name} on ${on_string} (rt table)"/> + <data name="intensity_file" format="parquet" label="${tool.name} on ${on_string} (intensity table)"/> </outputs> <tests> - <test> - <param name="ms_files"> - <collection type="list"> - <element name="mbr_test0.mzml" value="mbr_test0.mzml"/> - <element name="mbr_test1.mzml" value="mbr_test1.mzml"/> - <element name="mbr_test2.mzml" value="mbr_test2.mzml"/> - </collection> - </param> - <param name="corrected_files"> - <collection type="list"> - <element name="corrected_features_0.parquet" value="corrected_expected/corrected_0.parquet"/> - <element name="corrected_features_1.parquet" value="corrected_expected/corrected_1.parquet"/> - <element name="corrected_features_2.parquet" value="corrected_expected/corrected_2.parquet"/> - </collection> - </param> - <output name="tolerances" file="tolerances.parquet" ftype="parquet"/> - <output name="rt_cross_table" file="rt_cross_table.parquet" ftype="parquet"/> - <output name="int_cross_table" file="int_cross_table.parquet" ftype="parquet"/> - </test> + </tests> <help> <![CDATA[ - This is a tool which runs apLCMS alignment of features. + @ALIGN_FEATURES_HELP@ @GENERAL_HELP@ ]]> |
b |
diff -r b07fd3d7ffd0 -r abe783e0daca utils.R --- a/utils.R Thu Jun 16 10:26:58 2022 +0000 +++ b/utils.R Mon Feb 13 10:26:59 2023 +0000 |
[ |
b'@@ -1,149 +1,125 @@\n library(recetox.aplcms)\n \n-align_features <- function(sample_names, ...) {\n- aligned <- feature.align(...)\n- feature_names <- seq_len(nrow(aligned$pk.times))\n-\n- list(\n- mz_tolerance = as.numeric(aligned$mz.tol),\n- rt_tolerance = as.numeric(aligned$chr.tol),\n- rt_crosstab = as_feature_crosstab(feature_names, sample_names, aligned$pk.times),\n- int_crosstab = as_feature_crosstab(feature_names, sample_names, aligned$aligned.ftrs)\n- )\n+get_env_sample_name <- function() {\n+ sample_name <- Sys.getenv("SAMPLE_NAME", unset = NA)\n+ if (nchar(sample_name) == 0) {\n+ sample_name <- NA\n+ }\n+ if (is.na(sample_name)) {\n+ message("The mzML file does not contain run ID.")\n+ }\n+ return(sample_name)\n }\n \n-get_sample_name <- function(filename) {\n- tools::file_path_sans_ext(basename(filename))\n-}\n-\n-as_feature_crosstab <- function(feature_names, sample_names, data) {\n- colnames(data) <- c("mz", "rt", "mz_min", "mz_max", sample_names)\n- rownames(data) <- feature_names\n- as.data.frame(data)\n+save_sample_name <- function(df, sample_name) {\n+ attr(df, "sample_name") <- sample_name\n+ return(df)\n }\n \n-as_feature_sample_table <- function(rt_crosstab, int_crosstab) {\n- feature_names <- rownames(rt_crosstab)\n- sample_names <- colnames(rt_crosstab)[- (1:4)]\n-\n- feature_table <- data.frame(\n- feature = feature_names,\n- mz = rt_crosstab[, 1],\n- rt = rt_crosstab[, 2]\n- )\n-\n- # series of conversions to produce a table type from data.frame\n- rt_crosstab <- as.table(as.matrix(rt_crosstab[, - (1:4)]))\n- int_crosstab <- as.table(as.matrix(int_crosstab[, - (1:4)]))\n-\n- crosstab_axes <- list(feature = feature_names, sample = sample_names)\n- dimnames(rt_crosstab) <- dimnames(int_crosstab) <- crosstab_axes\n-\n- x <- as.data.frame(rt_crosstab, responseName = "sample_rt")\n- y <- as.data.frame(int_crosstab, responseName = "sample_intensity")\n-\n- data <- merge(x, y, by = c("feature", "sample"))\n- data <- merge(feature_table, data, by = "feature")\n- data\n+load_sample_name <- function(df) {\n+ sample_name <- attr(df, "sample_name")\n+ if (is.null(sample_name)) {\n+ return(NA)\n+ } else {\n+ return(sample_name)\n+ }\n }\n \n-load_features <- function(files) {\n- files_list <- sort_samples_by_acquisition_number(files)\n- features <- lapply(files_list, arrow::read_parquet)\n- features <- lapply(features, as.matrix)\n+save_data_as_parquet_file <- function(data, filename) {\n+ arrow::write_parquet(data, filename)\n+}\n+\n+load_data_from_parquet_file <- function(filename) {\n+ return(arrow::read_parquet(filename))\n+}\n+\n+load_parquet_collection <- function(files) {\n+ features <- lapply(files, arrow::read_parquet)\n+ features <- lapply(features, tibble::as_tibble)\n return(features)\n }\n \n-save_data_as_parquet_files <- function(data, subdir) {\n- dir.create(subdir)\n- for (i in 0:(length(data) - 1)) {\n- filename <- file.path(subdir, paste0(subdir, "_features_", i, ".parquet"))\n- arrow::write_parquet(as.data.frame(data[i + 1]), filename)\n- }\n+save_parquet_collection <- function(table, sample_names, subdir) {\n+ dir.create(subdir)\n+ for (i in seq_len(length(table$feature_tables))) {\n+ filename <- file.path(subdir, paste0(subdir, "_", sample_names[i], ".parquet"))\n+ feature_table <- as.data.frame(table$feature_tables[[i]])\n+ feature_table <- save_sample_name(feature_table, sample_names[i])\n+ arrow::write_parquet(feature_table, filename)\n+ }\n+}\n+\n+sort_by_sample_name <- function(tables, sample_names) {\n+ return(tables[order(sample_names)])\n }\n \n-save_aligned_features <- function(aligned, rt_file, int_file, tol_file) {\n- arrow::write_parquet(as.data.frame(aligned$rt_crosstab), rt_file)\n- arrow::write_parquet(as.data.frame(aligned$int_crosstab), int_file)\n-\n- mz_tolerance <- c(aligned$mz_tolerance)\n- rt_tolerance <- c(aligned$rt_tolerance)\n- arrow::write_parquet(data.frame(mz_tolerance, rt_tolerance), tol_fil'..b'filenames,\n- extracted,\n- corrected,\n- aligned,\n- mz_tol = 1e-05,\n- mz_range = NA,\n- rt_range = NA,\n- use_observed_range = TRUE,\n- min_bandwidth = NA,\n- max_bandwidth = NA,\n- recover_min_count = 3) {\n- if (!is(cluster, "cluster")) {\n- cluster <- parallel::makeCluster(cluster)\n- on.exit(parallel::stopCluster(cluster))\n- }\n-\n- clusterExport(cluster, c("extracted", "corrected", "aligned", "recover.weaker"))\n- clusterEvalQ(cluster, library("splines"))\n+select_table_with_sample_name <- function(tables, sample_name) {\n+ sample_names <- lapply(tables, load_sample_name)\n+ index <- which(sample_names == sample_name)\n+ if (length(index) > 0) {\n+ return(tables[[index]])\n+ } else {\n+ stop(sprintf("Mismatch - sample name \'%s\' not present in %s",\n+ sample_name, paste(sample_names, collapse = ", ")))\n+ }\n+}\n \n- recovered <- parLapply(cluster, seq_along(filenames), function(i) {\n- recover.weaker(\n- loc = i,\n- filename = filenames[[i]],\n- this.f1 = extracted[[i]],\n- this.f2 = corrected[[i]],\n- pk.times = aligned$rt_crosstab,\n- aligned.ftrs = aligned$int_crosstab,\n- orig.tol = mz_tol,\n- align.mz.tol = aligned$mz_tolerance,\n- align.chr.tol = aligned$rt_tolerance,\n- mz.range = mz_range,\n- chr.range = rt_range,\n- use.observed.range = use_observed_range,\n- bandwidth = 0.5,\n- min.bw = min_bandwidth,\n- max.bw = max_bandwidth,\n- recover.min.count = recover_min_count\n- )\n- })\n+select_adjusted <- function(recovered_features) {\n+ return(recovered_features$adjusted_features)\n+}\n \n- feature_table <- aligned$rt_crosstab[, 1:4]\n- rt_crosstab <- cbind(feature_table, sapply(recovered, function(x) x$this.times))\n- int_crosstab <- cbind(feature_table, sapply(recovered, function(x) x$this.ftrs))\n-\n- feature_names <- rownames(feature_table)\n- sample_names <- colnames(aligned$rt_crosstab[, - (1:4)])\n-\n- list(\n- extracted_features = lapply(recovered, function(x) x$this.f1),\n- corrected_features = lapply(recovered, function(x) x$this.f2),\n- rt_crosstab = as_feature_crosstab(feature_names, sample_names, rt_crosstab),\n- int_crosstab = as_feature_crosstab(feature_names, sample_names, int_crosstab)\n- )\n+known_table_columns <- function() {\n+ c("chemical_formula", "HMDB_ID", "KEGG_compound_ID", "mass", "ion.type",\n+ "m.z", "Number_profiles_processed", "Percent_found", "mz_min", "mz_max",\n+ "RT_mean", "RT_sd", "RT_min", "RT_max", "int_mean(log)", "int_sd(log)",\n+ "int_min(log)", "int_max(log)")\n }\n \n-create_feature_sample_table <- function(features) {\n- table <- as_feature_sample_table(\n- rt_crosstab = features$rt_crosstab,\n- int_crosstab = features$int_crosstab\n- )\n- return(table)\n+save_known_table <- function(table, filename) {\n+ columns <- known_table_columns()\n+ arrow::write_parquet(table$known_table[columns], filename)\n+}\n+\n+read_known_table <- function(filename) {\n+ arrow::read_parquet(filename, col_select = known_table_columns())\n+}\n+\n+save_pairing <- function(table, filename) {\n+ df <- table$pairing %>% as_tibble() %>% setNames(c("new", "old"))\n+ arrow::write_parquet(df, filename)\n }\n+\n+join_tables_to_list <- function(metadata, rt_table, intensity_table) {\n+ features <- new("list")\n+ features$metadata <- metadata\n+ features$intensity <- intensity_table\n+ features$rt <- rt_table\n+ return(features)\n+}\n+\n+validate_sample_names <- function(sample_names) {\n+ if ((any(is.na(sample_names))) || (length(unique(sample_names)) != length(sample_names))) {\n+ stop(sprintf("Sample names absent or not unique - provided sample names: %s",\n+ paste(sample_names, collapse = ", ")))\n+ }\n+}\n' |